* [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory
@ 2026-05-29 17:26 Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 01/15] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau (Meta)
` (14 more replies)
0 siblings, 15 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
This series adds userfaultfd support for tracking the working set of
VM guest memory, so a VMM can identify cold pages and evict them to
tiered or remote storage.
v1: https://lore.kernel.org/all/20260427114607.4068647-1-kas@kernel.org/
v2: https://lore.kernel.org/all/cover.1778254670.git.kas@kernel.org/
v3: https://lore.kernel.org/all/20260522133857.552279-1-kirill@shutemov.name/
v4: https://lore.kernel.org/all/20260525113737.1942478-1-kas@kernel.org/
v5: https://lore.kernel.org/all/20260526130509.2748441-1-kirill@shutemov.name/
This series is based on the "userfaultfd/pagemap: pre-existing fixes"
series, posted separately; that series carries the pre-existing
Fixes:/Cc: stable@ patches that used to live at the front of v5 (the
four from v5 plus two more surfaced since).
= Changes since v5 =
- Split the pre-existing fixes out into a separate series; this is
rebased on top. (Lorenzo, Andrew)
- Rework the mk_vma_flags() OOB fix to config-gated per-mode masks
(mk_vma_flags_from_masks()); moved to the fixes series. (Lorenzo)
- New prep patch 04/15: convert the userfaultfd_*() helpers to
vma_test_any_mask(). (Lorenzo)
- 08/15: gup_can_follow_protnone() forces the RWP fault only on
accessible VMAs, fixing a FOLL_FORCE loop on a VM_UFFD_RWP VMA that
was mprotect(PROT_NONE)'d.
uffd-unit-tests 113/113, pagemap_ioctl 117/117.
= Problem =
A VMM managing guest memory needs to:
1. detect which pages are still being touched (working-set
tracking);
2. safely evict cold pages to slower tiered or remote storage;
3. fetch them back on demand when accessed again.
= Approach =
UFFDIO_REGISTER_MODE_RWP is a new userfaultfd registration mode, in
parallel with the existing MODE_MISSING / MODE_WP / MODE_MINOR. It
uses the same mechanism on every backing -- anon, shmem, hugetlbfs:
- PAGE_NONE on the PTE (the same primitive NUMA balancing uses)
makes the page inaccessible while keeping it resident;
- the uffd PTE bit (the one MODE_WP already owns) marks the entry
as "userfaultfd-tracked" so the protnone fault path can tell an
RWP fault apart from an mprotect(PROT_NONE) or NUMA hinting
fault.
VM_UFFD_WP and VM_UFFD_RWP are mutually exclusive per VMA, so the
same PTE bit safely carries both meanings depending on the
registered VMA flag.
In sync mode, the kernel delivers a UFFD_PAGEFAULT_FLAG_RWP message
to the registered handler, and the handler resolves the fault with
UFFDIO_RWPROTECT clearing MODE_RWP. In async mode
(UFFD_FEATURE_RWP_ASYNC), the fault is auto-resolved in-place: the
kernel restores the original PTE permissions and the faulting thread
continues without a userfaultfd message ever being delivered.
Userspace then learns which pages were touched by reading
PAGE_IS_ACCESSED out of PAGEMAP_SCAN -- pages whose uffd bit is
still set were not re-accessed since the last RWP cycle.
UFFDIO_RWPROTECT is the protect/unprotect ioctl, mirroring
UFFDIO_WRITEPROTECT.
UFFDIO_SET_MODE flips RWP_ASYNC <-> sync at runtime under
mmap_write_lock() + vma_start_write(), so a VMM can run in async
mode for detection and switch to sync for race-free eviction without
re-registering the userfaultfd.
= Typical VMM workflow =
/* arm */
UFFDIO_API(features = RWP | RWP_ASYNC)
UFFDIO_REGISTER(MODE_RWP)
/* detection cycle (async) */
UFFDIO_RWPROTECT(range, RWP)
sleep(interval)
/* freeze the cold snapshot before scanning */
UFFDIO_SET_MODE(disable = RWP_ASYNC) /* sync */
PAGEMAP_SCAN(!PAGE_IS_ACCESSED) -> cold pages
/* eviction (sync mode traps races) */
pwrite(cold) + fallocate(FALLOC_FL_PUNCH_HOLE, cold)
UFFDIO_WAKE(cold)
UFFDIO_SET_MODE(enable = RWP_ASYNC) /* resume */
= Series layout =
Patches 1 to 4 are preparatory:
1: decouple protnone helpers from CONFIG_NUMA_BALANCING.
2-3: rename _PAGE_BIT_UFFD_WP, pte_uffd_wp() and friends to drop
the _WP suffix, since the bit now carries WP and RWP meaning
depending on the VMA flag. The SCAN_PTE_UFFD enum's ftrace
output string is intentionally kept as "pte_uffd_wp" so
trace-based tooling does not silently break.
4: convert the userfaultfd_*() flag helpers to vma_test_any_mask().
Patches 5 to 8 add the in-kernel mechanism:
5: VM_UFFD_RWP VMA flag (aliased to VM_NONE until 09/15 introduces
CONFIG_USERFAULTFD_RWP together with the UAPI).
6: MM_CP_UFFD_RWP change_protection() primitive (PAGE_NONE +
uffd bit, plus a RESOLVE counterpart).
7: marker preservation across swap, device-exclusive, migration,
fork, mremap, UFFDIO_MOVE, hugetlb copy, and mprotect().
8: handle VM_UFFD_RWP in khugepaged, rmap, and GUP.
Patches 9 to 13 wire the userspace surface:
9: UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing.
10: RWP fault delivery; turn the UAPI on.
11: PAGE_IS_ACCESSED in PAGEMAP_SCAN.
12: UFFD_FEATURE_RWP_ASYNC.
13: UFFDIO_SET_MODE runtime sync/async toggle.
Patches 14 to 15 are selftests and documentation.
Kiryl Shutsemau (Meta) (15):
mm: decouple protnone helpers from CONFIG_NUMA_BALANCING
mm: rename uffd-wp PTE bit macros to uffd
mm: rename uffd-wp PTE accessors to uffd
userfaultfd: test uffd VMA flags through the vma_flags_t API
mm: add VM_UFFD_RWP VMA flag
mm: add MM_CP_UFFD_RWP change_protection() flag
mm: preserve RWP marker across PTE rewrites
mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT
plumbing
mm/userfaultfd: add RWP fault delivery and expose
UFFDIO_REGISTER_MODE_RWP
mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking
userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution
userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
selftests/mm: add userfaultfd RWP tests
Documentation/userfaultfd: document RWP working set tracking
Documentation/admin-guide/mm/pagemap.rst | 13 +-
Documentation/admin-guide/mm/userfaultfd.rst | 253 +++++-
Documentation/filesystems/proc.rst | 1 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable-prot.h | 8 +-
arch/arm64/include/asm/pgtable.h | 47 +-
arch/loongarch/Kconfig | 1 +
arch/loongarch/include/asm/pgtable.h | 4 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 8 +-
arch/powerpc/platforms/Kconfig.cputype | 1 +
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/pgtable-bits.h | 12 +-
arch/riscv/include/asm/pgtable.h | 59 +-
arch/s390/Kconfig | 1 +
arch/s390/include/asm/hugetlb.h | 12 +-
arch/s390/include/asm/pgtable.h | 4 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 56 +-
arch/x86/include/asm/pgtable_types.h | 16 +-
fs/proc/task_mmu.c | 98 ++-
include/asm-generic/hugetlb.h | 18 +-
include/asm-generic/pgtable_uffd.h | 32 +-
include/linux/huge_mm.h | 7 +
include/linux/leafops.h | 4 +-
include/linux/mm.h | 65 +-
include/linux/mm_inline.h | 4 +-
include/linux/pgtable.h | 32 +-
include/linux/swapops.h | 4 +-
include/linux/userfaultfd_k.h | 89 ++-
include/trace/events/huge_memory.h | 2 +-
include/trace/events/mmflags.h | 7 +
include/uapi/linux/fs.h | 1 +
include/uapi/linux/userfaultfd.h | 54 +-
init/Kconfig | 8 +
mm/Kconfig | 9 +
mm/debug_vm_pgtable.c | 4 +-
mm/huge_memory.c | 159 ++--
mm/hugetlb.c | 158 +++-
mm/internal.h | 4 +-
mm/khugepaged.c | 40 +-
mm/memory.c | 135 +++-
mm/migrate.c | 20 +-
mm/migrate_device.c | 8 +-
mm/mprotect.c | 70 +-
mm/mremap.c | 17 +-
mm/page_table_check.c | 8 +-
mm/rmap.c | 18 +-
mm/swapfile.c | 9 +-
mm/userfaultfd.c | 387 +++++++++-
tools/include/uapi/linux/fs.h | 1 +
tools/testing/selftests/mm/uffd-unit-tests.c | 765 +++++++++++++++++++
51 files changed, 2300 insertions(+), 436 deletions(-)
base-commit: 9110948da327947a01830604020e0548c15b96f1
--
2.54.0
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v6 01/15] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 02/15] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau (Meta)
` (13 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
pte_protnone() and pmd_protnone() detect present-but-inaccessible page
table entries. This capability is useful beyond NUMA balancing -- for
example, userfaultfd working set tracking uses protnone PTEs to track
page access without unmapping pages.
Introduce CONFIG_ARCH_HAS_PTE_PROTNONE to decouple the protnone PTE
infrastructure from CONFIG_NUMA_BALANCING. The six architectures that
support protnone PTEs (x86_64, arm64, powerpc, s390, riscv, loongarch)
now select this option, and CONFIG_NUMA_BALANCING depends on it.
No functional change -- the same set of architectures continues to have
working protnone support, but the infrastructure is now available
independently of NUMA balancing.
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 7 ++---
arch/loongarch/Kconfig | 1 +
arch/loongarch/include/asm/pgtable.h | 4 +--
arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++---
arch/powerpc/platforms/Kconfig.cputype | 1 +
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/pgtable.h | 7 ++---
arch/s390/Kconfig | 1 +
arch/s390/include/asm/pgtable.h | 4 +--
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 8 ++---
include/linux/pgtable.h | 32 ++++++++++++++------
init/Kconfig | 8 +++++
mm/debug_vm_pgtable.c | 4 +--
15 files changed, 52 insertions(+), 36 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..319470b3b1bb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -78,6 +78,7 @@ config ARM64
select ARCH_SUPPORTS_CFI
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+ select ARCH_HAS_PTE_PROTNONE
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
select ARCH_SUPPORTS_PER_VMA_LOCK
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4dfa42b7d053..873f4ea2e288 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -553,10 +553,7 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
-#ifdef CONFIG_NUMA_BALANCING
-/*
- * See the comment in include/linux/pgtable.h
- */
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
/*
@@ -575,7 +572,7 @@ static inline int pmd_protnone(pmd_t pmd)
{
return pte_protnone(pmd_pte(pmd));
}
-#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#define pmd_present(pmd) pte_present(pmd_pte(pmd))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 606597da46b8..c085f5067b3b 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -67,6 +67,7 @@ config LOONGARCH
select ARCH_SUPPORTS_LTO_CLANG
select ARCH_SUPPORTS_LTO_CLANG_THIN
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
+ select ARCH_HAS_PTE_PROTNONE if 64BIT
select ARCH_SUPPORTS_NUMA_BALANCING if NUMA
select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_RT
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 2a0b63ae421f..d295447a2763 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -619,7 +619,7 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline long pte_protnone(pte_t pte)
{
return (pte_val(pte) & _PAGE_PROTNONE);
@@ -629,7 +629,7 @@ static inline long pmd_protnone(pmd_t pmd)
{
return (pmd_val(pmd) & _PAGE_PROTNONE);
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
#define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..53a0c5892548 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -490,13 +490,13 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte)
}
#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) ==
cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
static inline bool pte_hw_valid(pte_t pte)
{
@@ -1067,12 +1067,12 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#endif
#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pmd_protnone(pmd_t pmd)
{
return pte_protnone(pmd_pte(pmd));
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#define pmd_write(pmd) pte_write(pmd_pte(pmd))
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..36b64a24cf30 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -87,6 +87,7 @@ config PPC_BOOK3S_64
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_SPLIT_PMD_PTLOCK
select ARCH_SUPPORTS_HUGETLBFS
+ select ARCH_HAS_PTE_PROTNONE
select ARCH_SUPPORTS_NUMA_BALANCING
select HAVE_MOVE_PMD
select HAVE_MOVE_PUD
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c5754942cf85..e2c5776d18cf 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -71,6 +71,7 @@ config RISCV
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS if 64BIT && MMU
select ARCH_SUPPORTS_PAGE_TABLE_CHECK if MMU
select ARCH_SUPPORTS_PER_VMA_LOCK if MMU
+ select ARCH_HAS_PTE_PROTNONE if MMU
select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_SCHED_MC if SMP
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index a1a7c6520a09..48a127323b21 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -524,10 +524,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
PAGE_SIZE)
#endif
-#ifdef CONFIG_NUMA_BALANCING
-/*
- * See the comment in include/asm-generic/pgtable.h
- */
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE)) == _PAGE_PROT_NONE;
@@ -537,7 +534,7 @@ static inline int pmd_protnone(pmd_t pmd)
{
return pte_protnone(pmd_pte(pmd));
}
-#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
/* Modify page protection bits */
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ecbcbb781e40..bc5bef08454b 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -151,6 +151,7 @@ config S390
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && CC_IS_CLANG
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
+ select ARCH_HAS_PTE_PROTNONE
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
select ARCH_SUPPORTS_PER_VMA_LOCK
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2c6cee8241e0..97241dea5573 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -842,7 +842,7 @@ static inline int pte_same(pte_t a, pte_t b)
return pte_val(a) == pte_val(b);
}
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return pte_present(pte) && !(pte_val(pte) & _PAGE_READ);
@@ -853,7 +853,7 @@ static inline int pmd_protnone(pmd_t pmd)
/* pmd_leaf(pmd) implies pmd_present(pmd) */
return pmd_leaf(pmd) && !(pmd_val(pmd) & _SEGMENT_ENTRY_READ);
}
-#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
static inline bool pte_swp_exclusive(pte_t pte)
{
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f3f7cb01d69d..9da1119e8ff6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -123,6 +123,7 @@ config X86
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_PAGE_TABLE_CHECK if X86_64
+ select ARCH_HAS_PTE_PROTNONE if X86_64
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <= 4096
select ARCH_SUPPORTS_CFI if X86_64
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2187e9cfcefa..c7f014cbf0a9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -985,11 +985,7 @@ static inline int pmd_present(pmd_t pmd)
return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
}
-#ifdef CONFIG_NUMA_BALANCING
-/*
- * These work without NUMA balancing but the kernel does not care. See the
- * comment in include/linux/pgtable.h
- */
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return (pte_flags(pte) & (_PAGE_PROTNONE | _PAGE_PRESENT))
@@ -1001,7 +997,7 @@ static inline int pmd_protnone(pmd_t pmd)
return (pmd_flags(pmd) & (_PAGE_PROTNONE | _PAGE_PRESENT))
== _PAGE_PROTNONE;
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
static inline int pmd_none(pmd_t pmd)
{
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdd68ed3ae1a..b6516a11adfa 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -2052,18 +2052,26 @@ static inline int pud_trans_unstable(pud_t *pud)
return 0;
}
-#ifndef CONFIG_NUMA_BALANCING
+#ifndef CONFIG_ARCH_HAS_PTE_PROTNONE
/*
- * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
- * perfectly valid to indicate "no" in that case, which is why our default
- * implementation defaults to "always no".
+ * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It
+ * is perfectly valid to indicate "no" in that case, which is why our
+ * default implementation defaults to "always no".
*
- * In an accessible VMA, however, pte_protnone() reliably indicates PROT_NONE
- * page protection due to NUMA hinting. NUMA hinting faults only apply in
- * accessible VMAs.
+ * In an accessible VMA, pte_protnone() reliably indicates a present
+ * PROT_NONE page protection. Today the kernel uses such PTEs for two
+ * purposes: NUMA hinting faults, and userfaultfd RWP tracking on
+ * VM_UFFD_RWP VMAs. The two are distinguished by the uffd PTE bit and
+ * the VMA flag; see include/linux/userfaultfd_k.h.
*
- * So, to reliably identify PROT_NONE PTEs that require a NUMA hinting fault,
- * looking at the VMA accessibility is sufficient.
+ * So, to reliably identify PROT_NONE PTEs that require kernel handling,
+ * looking at the VMA accessibility (and the uffd bit on RWP VMAs) is
+ * sufficient.
+ *
+ * Architectures without CONFIG_ARCH_HAS_PTE_PROTNONE get the always-zero
+ * stubs below; PAGE_NONE references that survive to runtime fire the
+ * BUILD_BUG() fallback, since callers should have folded such paths to
+ * dead code via IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE).
*/
static inline int pte_protnone(pte_t pte)
{
@@ -2074,7 +2082,11 @@ static inline int pmd_protnone(pmd_t pmd)
{
return 0;
}
-#endif /* CONFIG_NUMA_BALANCING */
+
+#ifndef PAGE_NONE
+#define PAGE_NONE ({ BUILD_BUG(); (pgprot_t){0}; })
+#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#endif /* CONFIG_MMU */
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308ae..58abb7f19206 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -944,6 +944,13 @@ config SCHED_PROXY_EXEC
endmenu
+#
+# For architectures that support present-but-inaccessible (PROT_NONE) page
+# table entries detectable via pte_protnone() / pmd_protnone():
+#
+config ARCH_HAS_PTE_PROTNONE
+ bool
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
@@ -1010,6 +1017,7 @@ config ARCH_WANT_NUMA_VARIABLE_LOCALITY
config NUMA_BALANCING
bool "Memory placement aware NUMA scheduler"
depends on ARCH_SUPPORTS_NUMA_BALANCING
+ depends on ARCH_HAS_PTE_PROTNONE
depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
depends on SMP && NUMA_MIGRATION && !PREEMPT_RT
help
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 23dc3ee09561..5e9f3a35f924 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -672,7 +672,7 @@ static void __init pte_protnone_tests(struct pgtable_debug_args *args)
{
pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot_none);
- if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE))
return;
pr_debug("Validating PTE protnone\n");
@@ -685,7 +685,7 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args)
{
pmd_t pmd;
- if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE))
return;
if (!has_transparent_hugepage())
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 02/15] mm: rename uffd-wp PTE bit macros to uffd
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 01/15] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 03/15] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau (Meta)
` (12 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
The uffd-wp PTE bit is about to gain a second consumer: userfaultfd
RWP will use the same bit to mark access-tracking PTEs, distinct
from mprotect(PROT_NONE) or NUMA-hinting PTEs. WP vs RWP semantics
come from the VMA flag; the bit is just "uffd has claimed this
entry." Drop the "_wp" suffix from the arch-private bit macros so
they reflect that.
x86: _PAGE_BIT_UFFD_WP -> _PAGE_BIT_UFFD
_PAGE_UFFD_WP -> _PAGE_UFFD
_PAGE_SWP_UFFD_WP -> _PAGE_SWP_UFFD
arm64: PTE_UFFD_WP -> PTE_UFFD
PTE_SWP_UFFD_WP -> PTE_SWP_UFFD
riscv: _PAGE_UFFD_WP -> _PAGE_UFFD
_PAGE_SWP_UFFD_WP -> _PAGE_SWP_UFFD
Pure mechanical rename -- no behavior change.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
arch/arm64/include/asm/pgtable-prot.h | 8 ++++----
arch/arm64/include/asm/pgtable.h | 12 ++++++------
arch/riscv/include/asm/pgtable-bits.h | 12 ++++++------
arch/riscv/include/asm/pgtable.h | 14 +++++++-------
arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
arch/x86/include/asm/pgtable_types.h | 16 ++++++++--------
6 files changed, 43 insertions(+), 43 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 212ce1b02e15..09d7c00cf405 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -28,11 +28,11 @@
#define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
-#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
+#define PTE_UFFD (_AT(pteval_t, 1) << 58) /* userfaultfd tracking */
+#define PTE_SWP_UFFD (_AT(pteval_t, 1) << 3) /* only for swp ptes */
#else
-#define PTE_UFFD_WP (_AT(pteval_t, 0))
-#define PTE_SWP_UFFD_WP (_AT(pteval_t, 0))
+#define PTE_UFFD (_AT(pteval_t, 0))
+#define PTE_SWP_UFFD (_AT(pteval_t, 0))
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
#define _PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 873f4ea2e288..3eecb2c17711 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -343,17 +343,17 @@ static inline pmd_t pmd_mknoncont(pmd_t pmd)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & PTE_UFFD_WP);
+ return !!(pte_val(pte) & PTE_UFFD);
}
static inline pte_t pte_mkuffd_wp(pte_t pte)
{
- return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD_WP)));
+ return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD)));
}
static inline pte_t pte_clear_uffd_wp(pte_t pte)
{
- return clear_pte_bit(pte, __pgprot(PTE_UFFD_WP));
+ return clear_pte_bit(pte, __pgprot(PTE_UFFD));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -539,17 +539,17 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
- return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP));
+ return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
static inline int pte_swp_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & PTE_SWP_UFFD_WP);
+ return !!(pte_val(pte) & PTE_SWP_UFFD);
}
static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
- return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP));
+ return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
index b422d9691e60..d5a86b4df3ce 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -40,20 +40,20 @@
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-/* ext_svrsw60t59b: Bit(60) for uffd-wp tracking */
-#define _PAGE_UFFD_WP \
+/* ext_svrsw60t59b: Bit(60) for userfaultfd tracking */
+#define _PAGE_UFFD \
((riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)) ? \
(1UL << 60) : 0)
/*
* Bit 4 is not involved into swap entry computation, so we
- * can borrow it for swap page uffd-wp tracking.
+ * can borrow it for swap page userfaultfd tracking.
*/
-#define _PAGE_SWP_UFFD_WP \
+#define _PAGE_SWP_UFFD \
((riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)) ? \
_PAGE_USER : 0)
#else
-#define _PAGE_UFFD_WP 0
-#define _PAGE_SWP_UFFD_WP 0
+#define _PAGE_UFFD 0
+#define _PAGE_SWP_UFFD 0
#endif
#define _PAGE_TABLE _PAGE_PRESENT
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 48a127323b21..ca69948b3ed8 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -405,32 +405,32 @@ static inline pte_t pte_wrprotect(pte_t pte)
static inline bool pte_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & _PAGE_UFFD_WP);
+ return !!(pte_val(pte) & _PAGE_UFFD);
}
static inline pte_t pte_mkuffd_wp(pte_t pte)
{
- return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD_WP));
+ return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD));
}
static inline pte_t pte_clear_uffd_wp(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_UFFD_WP));
+ return __pte(pte_val(pte) & ~(_PAGE_UFFD));
}
static inline bool pte_swp_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & _PAGE_SWP_UFFD_WP);
+ return !!(pte_val(pte) & _PAGE_SWP_UFFD);
}
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_SWP_UFFD_WP);
+ return __pte(pte_val(pte) | _PAGE_SWP_UFFD);
}
static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD_WP));
+ return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -1157,7 +1157,7 @@ static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
* bit 0: _PAGE_PRESENT (zero)
* bit 1 to 2: (zero)
* bit 3: _PAGE_SWP_SOFT_DIRTY
- * bit 4: _PAGE_SWP_UFFD_WP
+ * bit 4: _PAGE_SWP_UFFD
* bit 5: _PAGE_PROT_NONE (zero)
* bit 6: exclusive marker
* bits 7 to 11: swap type
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c7f014cbf0a9..038c806b50a2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -413,17 +413,17 @@ static inline pte_t pte_wrprotect(pte_t pte)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
- return pte_flags(pte) & _PAGE_UFFD_WP;
+ return pte_flags(pte) & _PAGE_UFFD;
}
static inline pte_t pte_mkuffd_wp(pte_t pte)
{
- return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD_WP));
+ return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD));
}
static inline pte_t pte_clear_uffd_wp(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_UFFD_WP);
+ return pte_clear_flags(pte, _PAGE_UFFD);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -528,17 +528,17 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pmd_uffd_wp(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_UFFD_WP;
+ return pmd_flags(pmd) & _PAGE_UFFD;
}
static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
{
- return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD_WP));
+ return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD));
}
static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
+ return pmd_clear_flags(pmd, _PAGE_UFFD);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -1550,32 +1550,32 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
+ return pte_set_flags(pte, _PAGE_SWP_UFFD);
}
static inline int pte_swp_uffd_wp(pte_t pte)
{
- return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
+ return pte_flags(pte) & _PAGE_SWP_UFFD;
}
static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
+ return pte_clear_flags(pte, _PAGE_SWP_UFFD);
}
static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
+ return pmd_set_flags(pmd, _PAGE_SWP_UFFD);
}
static inline int pmd_swp_uffd_wp(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
+ return pmd_flags(pmd) & _PAGE_SWP_UFFD;
}
static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
+ return pmd_clear_flags(pmd, _PAGE_SWP_UFFD);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 2ec250ba467e..af08d98be930 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -31,7 +31,7 @@
#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
-#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
+#define _PAGE_BIT_UFFD _PAGE_BIT_SOFTW2 /* userfaultfd tracking */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_KERNEL_4K _PAGE_BIT_SOFTW3 /* page must not be converted to large */
@@ -39,7 +39,7 @@
#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */
#define _PAGE_BIT_NOPTISHADOW _PAGE_BIT_SOFTW5 /* No PTI shadow (root PGD) */
#else
-/* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */
+/* Shared with _PAGE_BIT_UFFD which is not supported on 32 bit */
#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW2 /* Saved Dirty bit (leaf) */
#define _PAGE_BIT_NOPTISHADOW _PAGE_BIT_SOFTW2 /* No PTI shadow (root PGD) */
#endif
@@ -111,11 +111,11 @@
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
-#define _PAGE_SWP_UFFD_WP _PAGE_USER
+#define _PAGE_UFFD (_AT(pteval_t, 1) << _PAGE_BIT_UFFD)
+#define _PAGE_SWP_UFFD _PAGE_USER
#else
-#define _PAGE_UFFD_WP (_AT(pteval_t, 0))
-#define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0))
+#define _PAGE_UFFD (_AT(pteval_t, 0))
+#define _PAGE_SWP_UFFD (_AT(pteval_t, 0))
#endif
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
@@ -129,7 +129,7 @@
/*
* The hardware requires shadow stack to be Write=0,Dirty=1. However,
* there are valid cases where the kernel might create read-only PTEs that
- * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty tracking). In
+ * are dirty (e.g., fork(), mprotect(), userfaultfd, soft-dirty tracking). In
* this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bit,
* to avoid creating a wrong "shadow stack" PTEs. Such PTEs have
* (Write=0,SavedDirty=1,Dirty=0) set.
@@ -151,7 +151,7 @@
#define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | \
_PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \
- _PAGE_CC | _PAGE_UFFD_WP)
+ _PAGE_CC | _PAGE_UFFD)
#define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT)
#define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE)
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 03/15] mm: rename uffd-wp PTE accessors to uffd
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 01/15] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 02/15] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API Kiryl Shutsemau (Meta)
` (11 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Userfaultfd RWP will reuse the uffd-wp PTE bit to mark access-tracking
PTEs, alongside the write-protected ones it already marks. The bit's
meaning now depends on the VMA flag (WP or RWP), not on its name.
Rename the kernel-internal names that describe the bit:
- pte/pmd/huge_pte accessors (and swap variants)
- pgtable_supports_uffd() capability query
- SCAN_PTE_UFFD khugepaged enum
The ftrace string emitted by mm_khugepaged_scan_pmd for this enum is
kept as "pte_uffd_wp" so existing trace-based tooling keeps matching.
Pure mechanical rename -- no behavior change.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
arch/arm64/include/asm/pgtable.h | 28 +++++++--------
arch/riscv/include/asm/pgtable.h | 38 ++++++++++----------
arch/s390/include/asm/hugetlb.h | 12 +++----
arch/x86/include/asm/pgtable.h | 24 ++++++-------
fs/proc/task_mmu.c | 44 +++++++++++------------
include/asm-generic/hugetlb.h | 18 +++++-----
include/asm-generic/pgtable_uffd.h | 32 ++++++++---------
include/linux/leafops.h | 4 +--
include/linux/mm_inline.h | 4 +--
include/linux/swapops.h | 4 +--
include/linux/userfaultfd_k.h | 14 ++++----
include/trace/events/huge_memory.h | 2 +-
mm/huge_memory.c | 56 +++++++++++++++---------------
mm/hugetlb.c | 46 ++++++++++++------------
mm/internal.h | 4 +--
mm/khugepaged.c | 22 ++++++------
mm/memory.c | 34 +++++++++---------
mm/migrate.c | 12 +++----
mm/migrate_device.c | 8 ++---
mm/mprotect.c | 12 +++----
mm/mremap.c | 4 +--
mm/page_table_check.c | 8 ++---
mm/rmap.c | 16 ++++-----
mm/swapfile.c | 4 +--
mm/userfaultfd.c | 6 ++--
25 files changed, 228 insertions(+), 228 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3eecb2c17711..c41e4d59dc9f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -341,17 +341,17 @@ static inline pmd_t pmd_mknoncont(pmd_t pmd)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline int pte_uffd_wp(pte_t pte)
+static inline int pte_uffd(pte_t pte)
{
return !!(pte_val(pte) & PTE_UFFD);
}
-static inline pte_t pte_mkuffd_wp(pte_t pte)
+static inline pte_t pte_mkuffd(pte_t pte)
{
return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD)));
}
-static inline pte_t pte_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_clear_uffd(pte_t pte)
{
return clear_pte_bit(pte, __pgprot(PTE_UFFD));
}
@@ -537,17 +537,17 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static inline pte_t pte_swp_mkuffd(pte_t pte)
{
return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
-static inline int pte_swp_uffd_wp(pte_t pte)
+static inline int pte_swp_uffd(pte_t pte)
{
return !!(pte_val(pte) & PTE_SWP_UFFD);
}
-static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
@@ -590,13 +590,13 @@ static inline int pmd_protnone(pmd_t pmd)
#define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd)))
#define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd)))
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd))
-#define pmd_mkuffd_wp(pmd) pte_pmd(pte_mkuffd_wp(pmd_pte(pmd)))
-#define pmd_clear_uffd_wp(pmd) pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd)))
-#define pmd_swp_uffd_wp(pmd) pte_swp_uffd_wp(pmd_pte(pmd))
-#define pmd_swp_mkuffd_wp(pmd) pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd)))
-#define pmd_swp_clear_uffd_wp(pmd) \
- pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)))
+#define pmd_uffd(pmd) pte_uffd(pmd_pte(pmd))
+#define pmd_mkuffd(pmd) pte_pmd(pte_mkuffd(pmd_pte(pmd)))
+#define pmd_clear_uffd(pmd) pte_pmd(pte_clear_uffd(pmd_pte(pmd)))
+#define pmd_swp_uffd(pmd) pte_swp_uffd(pmd_pte(pmd))
+#define pmd_swp_mkuffd(pmd) pte_pmd(pte_swp_mkuffd(pmd_pte(pmd)))
+#define pmd_swp_clear_uffd(pmd) \
+ pte_pmd(pte_swp_clear_uffd(pmd_pte(pmd)))
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
#define pmd_write(pmd) pte_write(pmd_pte(pmd))
@@ -1512,7 +1512,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
* Encode and decode a swap entry:
* bits 0-1: present (must be zero)
* bits 2: remember PG_anon_exclusive
- * bit 3: remember uffd-wp state
+ * bit 3: remember uffd state
* bits 6-10: swap type
* bit 11: PTE_PRESENT_INVALID (must be zero)
* bits 12-61: swap offset
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ca69948b3ed8..b111e134795e 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -400,35 +400,35 @@ static inline pte_t pte_wrprotect(pte_t pte)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define pgtable_supports_uffd_wp() \
+#define pgtable_supports_uffd() \
riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)
-static inline bool pte_uffd_wp(pte_t pte)
+static inline bool pte_uffd(pte_t pte)
{
return !!(pte_val(pte) & _PAGE_UFFD);
}
-static inline pte_t pte_mkuffd_wp(pte_t pte)
+static inline pte_t pte_mkuffd(pte_t pte)
{
return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD));
}
-static inline pte_t pte_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_clear_uffd(pte_t pte)
{
return __pte(pte_val(pte) & ~(_PAGE_UFFD));
}
-static inline bool pte_swp_uffd_wp(pte_t pte)
+static inline bool pte_swp_uffd(pte_t pte)
{
return !!(pte_val(pte) & _PAGE_SWP_UFFD);
}
-static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static inline pte_t pte_swp_mkuffd(pte_t pte)
{
return __pte(pte_val(pte) | _PAGE_SWP_UFFD);
}
-static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD));
}
@@ -886,34 +886,34 @@ static inline pud_t pud_mkspecial(pud_t pud)
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline bool pmd_uffd_wp(pmd_t pmd)
+static inline bool pmd_uffd(pmd_t pmd)
{
- return pte_uffd_wp(pmd_pte(pmd));
+ return pte_uffd(pmd_pte(pmd));
}
-static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_mkuffd(pmd_t pmd)
{
- return pte_pmd(pte_mkuffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_mkuffd(pmd_pte(pmd)));
}
-static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_clear_uffd(pmd_t pmd)
{
- return pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_clear_uffd(pmd_pte(pmd)));
}
-static inline bool pmd_swp_uffd_wp(pmd_t pmd)
+static inline bool pmd_swp_uffd(pmd_t pmd)
{
- return pte_swp_uffd_wp(pmd_pte(pmd));
+ return pte_swp_uffd(pmd_pte(pmd));
}
-static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_mkuffd(pmd_t pmd)
{
- return pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_swp_mkuffd(pmd_pte(pmd)));
}
-static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd)
{
- return pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_swp_clear_uffd(pmd_pte(pmd)));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 6983e52eaf81..cf8a176ff3d8 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -77,20 +77,20 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
__set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte));
}
-#define __HAVE_ARCH_HUGE_PTE_MKUFFD_WP
-static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+#define __HAVE_ARCH_HUGE_PTE_MKUFFD
+static inline pte_t huge_pte_mkuffd(pte_t pte)
{
return pte;
}
-#define __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD_WP
-static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+#define __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD
+static inline pte_t huge_pte_clear_uffd(pte_t pte)
{
return pte;
}
-#define __HAVE_ARCH_HUGE_PTE_UFFD_WP
-static inline int huge_pte_uffd_wp(pte_t pte)
+#define __HAVE_ARCH_HUGE_PTE_UFFD
+static inline int huge_pte_uffd(pte_t pte)
{
return 0;
}
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 038c806b50a2..d14c84b2a332 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -411,17 +411,17 @@ static inline pte_t pte_wrprotect(pte_t pte)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline int pte_uffd_wp(pte_t pte)
+static inline int pte_uffd(pte_t pte)
{
return pte_flags(pte) & _PAGE_UFFD;
}
-static inline pte_t pte_mkuffd_wp(pte_t pte)
+static inline pte_t pte_mkuffd(pte_t pte)
{
return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD));
}
-static inline pte_t pte_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_clear_uffd(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_UFFD);
}
@@ -526,17 +526,17 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline int pmd_uffd_wp(pmd_t pmd)
+static inline int pmd_uffd(pmd_t pmd)
{
return pmd_flags(pmd) & _PAGE_UFFD;
}
-static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_mkuffd(pmd_t pmd)
{
return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD));
}
-static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_clear_uffd(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_UFFD);
}
@@ -1548,32 +1548,32 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static inline pte_t pte_swp_mkuffd(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_UFFD);
}
-static inline int pte_swp_uffd_wp(pte_t pte)
+static inline int pte_swp_uffd(pte_t pte)
{
return pte_flags(pte) & _PAGE_SWP_UFFD;
}
-static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_UFFD);
}
-static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_mkuffd(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_SWP_UFFD);
}
-static inline int pmd_swp_uffd_wp(pmd_t pmd)
+static inline int pmd_swp_uffd(pmd_t pmd)
{
return pmd_flags(pmd) & _PAGE_SWP_UFFD;
}
-static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_SWP_UFFD);
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 06fb94a965ff..939657aa334a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2035,14 +2035,14 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
- if (pte_uffd_wp(pte))
+ if (pte_uffd(pte))
flags |= PM_UFFD_WP;
} else {
softleaf_t entry;
if (pte_swp_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
- if (pte_swp_uffd_wp(pte))
+ if (pte_swp_uffd(pte))
flags |= PM_UFFD_WP;
entry = softleaf_from_pte(pte);
if (pm->show_pfn) {
@@ -2108,7 +2108,7 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
flags |= PM_PRESENT;
if (pmd_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
- if (pmd_uffd_wp(pmd))
+ if (pmd_uffd(pmd))
flags |= PM_UFFD_WP;
if (pm->show_pfn)
frame = pmd_pfn(pmd) + idx;
@@ -2127,7 +2127,7 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
flags |= PM_SWAP;
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
- if (pmd_swp_uffd_wp(pmd))
+ if (pmd_swp_uffd(pmd))
flags |= PM_UFFD_WP;
VM_WARN_ON_ONCE(!pmd_is_migration_entry(pmd));
page = softleaf_to_page(entry);
@@ -2233,14 +2233,14 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
!hugetlb_pmd_shared(ptep))
flags |= PM_MMAP_EXCLUSIVE;
- if (huge_pte_uffd_wp(pte))
+ if (huge_pte_uffd(pte))
flags |= PM_UFFD_WP;
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pte_pfn(pte) +
((addr & ~hmask) >> PAGE_SHIFT);
- } else if (pte_swp_uffd_wp_any(pte)) {
+ } else if (pte_swp_uffd_any(pte)) {
flags |= PM_UFFD_WP;
}
@@ -2441,7 +2441,7 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_PRESENT;
- if (!pte_uffd_wp(pte))
+ if (!pte_uffd(pte))
categories |= PAGE_IS_WRITTEN;
if (p->masks_of_interest & PAGE_IS_FILE) {
@@ -2459,7 +2459,7 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_wp_any(pte))
+ if (!pte_swp_uffd_any(pte))
categories |= PAGE_IS_WRITTEN;
entry = softleaf_from_pte(pte);
@@ -2484,13 +2484,13 @@ static void make_uffd_wp_pte(struct vm_area_struct *vma,
pte_t old_pte;
old_pte = ptep_modify_prot_start(vma, addr, pte);
- ptent = pte_mkuffd_wp(old_pte);
+ ptent = pte_mkuffd(old_pte);
ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
} else if (pte_none(ptent)) {
set_pte_at(vma->vm_mm, addr, pte,
make_pte_marker(PTE_MARKER_UFFD_WP));
} else {
- ptent = pte_swp_mkuffd_wp(ptent);
+ ptent = pte_swp_mkuffd(ptent);
set_pte_at(vma->vm_mm, addr, pte, ptent);
}
}
@@ -2509,7 +2509,7 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
struct page *page;
categories |= PAGE_IS_PRESENT;
- if (!pmd_uffd_wp(pmd))
+ if (!pmd_uffd(pmd))
categories |= PAGE_IS_WRITTEN;
if (p->masks_of_interest & PAGE_IS_FILE) {
@@ -2524,7 +2524,7 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
categories |= PAGE_IS_SOFT_DIRTY;
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pmd_swp_uffd_wp(pmd))
+ if (!pmd_swp_uffd(pmd))
categories |= PAGE_IS_WRITTEN;
if (pmd_swp_soft_dirty(pmd))
categories |= PAGE_IS_SOFT_DIRTY;
@@ -2548,10 +2548,10 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
if (pmd_present(pmd)) {
old = pmdp_invalidate_ad(vma, addr, pmdp);
- pmd = pmd_mkuffd_wp(old);
+ pmd = pmd_mkuffd(old);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
} else if (pmd_is_migration_entry(pmd)) {
- pmd = pmd_swp_mkuffd_wp(pmd);
+ pmd = pmd_swp_mkuffd(pmd);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
}
}
@@ -2573,7 +2573,7 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
if (pte_present(pte)) {
categories |= PAGE_IS_PRESENT;
- if (!huge_pte_uffd_wp(pte))
+ if (!huge_pte_uffd(pte))
categories |= PAGE_IS_WRITTEN;
if (!PageAnon(pte_page(pte)))
categories |= PAGE_IS_FILE;
@@ -2584,7 +2584,7 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_wp_any(pte))
+ if (!pte_swp_uffd_any(pte))
categories |= PAGE_IS_WRITTEN;
if (pte_swp_soft_dirty(pte))
categories |= PAGE_IS_SOFT_DIRTY;
@@ -2612,12 +2612,12 @@ static void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
if (softleaf_is_migration(entry)) {
set_huge_pte_at(vma->vm_mm, addr, ptep,
- pte_swp_mkuffd_wp(ptent), psize);
+ pte_swp_mkuffd(ptent), psize);
} else {
pte_t old_pte, new_pte;
old_pte = huge_ptep_modify_prot_start(vma, addr, ptep);
- new_pte = huge_pte_mkuffd_wp(old_pte);
+ new_pte = huge_pte_mkuffd(old_pte);
huge_ptep_modify_prot_commit(vma, addr, ptep, old_pte, new_pte);
}
}
@@ -2850,8 +2850,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
pte_t ptent = ptep_get(pte);
- if ((pte_present(ptent) && pte_uffd_wp(ptent)) ||
- pte_swp_uffd_wp_any(ptent))
+ if ((pte_present(ptent) && pte_uffd(ptent)) ||
+ pte_swp_uffd_any(ptent))
continue;
make_uffd_wp_pte(vma, addr, pte, ptent);
if (!flush_end)
@@ -2868,8 +2868,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long next = addr + PAGE_SIZE;
pte_t ptent = ptep_get(pte);
- if ((pte_present(ptent) && pte_uffd_wp(ptent)) ||
- pte_swp_uffd_wp_any(ptent))
+ if ((pte_present(ptent) && pte_uffd(ptent)) ||
+ pte_swp_uffd_any(ptent))
continue;
ret = pagemap_scan_output(p->cur_vma_category | PAGE_IS_WRITTEN,
p, addr, &next);
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index e1a2e1b7c8e7..635c41cc3479 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -37,24 +37,24 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
return pte_modify(pte, newprot);
}
-#ifndef __HAVE_ARCH_HUGE_PTE_MKUFFD_WP
-static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+#ifndef __HAVE_ARCH_HUGE_PTE_MKUFFD
+static inline pte_t huge_pte_mkuffd(pte_t pte)
{
- return huge_pte_wrprotect(pte_mkuffd_wp(pte));
+ return huge_pte_wrprotect(pte_mkuffd(pte));
}
#endif
-#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD_WP
-static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD
+static inline pte_t huge_pte_clear_uffd(pte_t pte)
{
- return pte_clear_uffd_wp(pte);
+ return pte_clear_uffd(pte);
}
#endif
-#ifndef __HAVE_ARCH_HUGE_PTE_UFFD_WP
-static inline int huge_pte_uffd_wp(pte_t pte)
+#ifndef __HAVE_ARCH_HUGE_PTE_UFFD
+static inline int huge_pte_uffd(pte_t pte)
{
- return pte_uffd_wp(pte);
+ return pte_uffd(pte);
}
#endif
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 0d85791efdf7..30e88fc1de2f 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -2,79 +2,79 @@
#define _ASM_GENERIC_PGTABLE_UFFD_H
/*
- * Some platforms can customize the uffd-wp bit, making it unavailable
+ * Some platforms can customize the uffd PTE bit, making it unavailable
* even if the architecture provides the resource.
* Adding this API allows architectures to add their own checks for the
* devices on which the kernel is running.
* Note: When overriding it, please make sure the
* CONFIG_HAVE_ARCH_USERFAULTFD_WP is part of this macro.
*/
-#ifndef pgtable_supports_uffd_wp
-#define pgtable_supports_uffd_wp() IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP)
+#ifndef pgtable_supports_uffd
+#define pgtable_supports_uffd() IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP)
#endif
static inline bool uffd_supports_wp_marker(void)
{
- return pgtable_supports_uffd_wp() && IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP);
+ return pgtable_supports_uffd() && IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP);
}
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static __always_inline int pte_uffd_wp(pte_t pte)
+static __always_inline int pte_uffd(pte_t pte)
{
return 0;
}
-static __always_inline int pmd_uffd_wp(pmd_t pmd)
+static __always_inline int pmd_uffd(pmd_t pmd)
{
return 0;
}
-static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
+static __always_inline pte_t pte_mkuffd(pte_t pte)
{
return pte;
}
-static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+static __always_inline pmd_t pmd_mkuffd(pmd_t pmd)
{
return pmd;
}
-static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
+static __always_inline pte_t pte_clear_uffd(pte_t pte)
{
return pte;
}
-static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+static __always_inline pmd_t pmd_clear_uffd(pmd_t pmd)
{
return pmd;
}
-static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static __always_inline pte_t pte_swp_mkuffd(pte_t pte)
{
return pte;
}
-static __always_inline int pte_swp_uffd_wp(pte_t pte)
+static __always_inline int pte_swp_uffd(pte_t pte)
{
return 0;
}
-static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static __always_inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return pte;
}
-static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_mkuffd(pmd_t pmd)
{
return pmd;
}
-static inline int pmd_swp_uffd_wp(pmd_t pmd)
+static inline int pmd_swp_uffd(pmd_t pmd)
{
return 0;
}
-static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd)
{
return pmd;
}
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 992cd8bd8ed0..2ce2f37ac883 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -100,8 +100,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
if (pmd_swp_soft_dirty(pmd))
pmd = pmd_swp_clear_soft_dirty(pmd);
- if (pmd_swp_uffd_wp(pmd))
- pmd = pmd_swp_clear_uffd_wp(pmd);
+ if (pmd_swp_uffd(pmd))
+ pmd = pmd_swp_clear_uffd(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
/* Temporary until swp_entry_t eliminated. */
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index a171070e15f0..2811caf4188d 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -600,14 +600,14 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
return false;
/* A uffd-wp wr-protected normal pte */
- if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+ if (unlikely(pte_present(pteval) && pte_uffd(pteval)))
arm_uffd_pte = true;
/*
* A uffd-wp wr-protected swap pte. Note: this should even cover an
* existing pte marker with uffd-wp bit set.
*/
- if (unlikely(pte_swp_uffd_wp_any(pteval)))
+ if (unlikely(pte_swp_uffd_any(pteval)))
arm_uffd_pte = true;
if (unlikely(arm_uffd_pte)) {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 8cfc966eae48..15c6440e38dd 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -73,8 +73,8 @@ static inline pte_t pte_swp_clear_flags(pte_t pte)
pte = pte_swp_clear_exclusive(pte);
if (pte_swp_soft_dirty(pte))
pte = pte_swp_clear_soft_dirty(pte);
- if (pte_swp_uffd_wp(pte))
- pte = pte_swp_clear_uffd_wp(pte);
+ if (pte_swp_uffd(pte))
+ pte = pte_swp_clear_uffd(pte);
return pte;
}
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 68edac4dcd78..658740df2978 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -211,13 +211,13 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
- return userfaultfd_wp(vma) && pte_uffd_wp(pte);
+ return userfaultfd_wp(vma) && pte_uffd(pte);
}
static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
pmd_t pmd)
{
- return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
+ return userfaultfd_wp(vma) && pmd_uffd(pmd);
}
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
@@ -272,10 +272,10 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
}
/*
- * Returns true if this is a swap pte and was uffd-wp wr-protected in either
- * forms (pte marker or a normal swap pte), false otherwise.
+ * Returns true if this swap pte carries uffd-tracked state in either
+ * form (pte marker or a normal swap pte), false otherwise.
*/
-static inline bool pte_swp_uffd_wp_any(pte_t pte)
+static inline bool pte_swp_uffd_any(pte_t pte)
{
if (!uffd_supports_wp_marker())
return false;
@@ -283,7 +283,7 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte)
if (pte_present(pte))
return false;
- if (pte_swp_uffd_wp(pte))
+ if (pte_swp_uffd(pte))
return true;
if (pte_is_uffd_wp_marker(pte))
@@ -424,7 +424,7 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
* Returns true if this is a swap pte and was uffd-wp wr-protected in either
* forms (pte marker or a normal swap pte), false otherwise.
*/
-static inline bool pte_swp_uffd_wp_any(pte_t pte)
+static inline bool pte_swp_uffd_any(pte_t pte)
{
return false;
}
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 291fae364c62..5a48c5406cce 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -16,7 +16,7 @@
EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \
EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
- EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
+ EM( SCAN_PTE_UFFD, "pte_uffd_wp") \
EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
EM( SCAN_PAGE_NULL, "page_null") \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7c895b1d366..d43c2255f47d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1909,8 +1909,8 @@ static void copy_huge_non_present_pmd(
pmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
- if (pmd_swp_uffd_wp(*src_pmd))
- pmd = pmd_swp_mkuffd_wp(pmd);
+ if (pmd_swp_uffd(*src_pmd))
+ pmd = pmd_swp_mkuffd(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
} else if (softleaf_is_device_private(entry)) {
/*
@@ -1923,8 +1923,8 @@ static void copy_huge_non_present_pmd(
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
- if (pmd_swp_uffd_wp(*src_pmd))
- pmd = pmd_swp_mkuffd_wp(pmd);
+ if (pmd_swp_uffd(*src_pmd))
+ pmd = pmd_swp_mkuffd(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
@@ -1944,7 +1944,7 @@ static void copy_huge_non_present_pmd(
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
if (!userfaultfd_wp(dst_vma))
- pmd = pmd_swp_clear_uffd_wp(pmd);
+ pmd = pmd_swp_clear_uffd(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
}
@@ -2040,7 +2040,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
pmdp_set_wrprotect(src_mm, addr, src_pmd);
if (!userfaultfd_wp(dst_vma))
- pmd = pmd_clear_uffd_wp(pmd);
+ pmd = pmd_clear_uffd(pmd);
pmd = pmd_wrprotect(pmd);
set_pmd:
pmd = pmd_mkold(pmd);
@@ -2581,9 +2581,9 @@ static pmd_t clear_uffd_wp_pmd(pmd_t pmd)
if (pmd_none(pmd))
return pmd;
if (pmd_present(pmd))
- pmd = pmd_clear_uffd_wp(pmd);
+ pmd = pmd_clear_uffd(pmd);
else
- pmd = pmd_swp_clear_uffd_wp(pmd);
+ pmd = pmd_swp_clear_uffd(pmd);
return pmd;
}
@@ -2663,16 +2663,16 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
} else if (softleaf_is_device_private_write(entry)) {
entry = make_readable_device_private_entry(swp_offset(entry));
newpmd = swp_entry_to_pmd(entry);
- if (pmd_swp_uffd_wp(*pmd))
- newpmd = pmd_swp_mkuffd_wp(newpmd);
+ if (pmd_swp_uffd(*pmd))
+ newpmd = pmd_swp_mkuffd(newpmd);
} else {
newpmd = *pmd;
}
if (uffd_wp)
- newpmd = pmd_swp_mkuffd_wp(newpmd);
+ newpmd = pmd_swp_mkuffd(newpmd);
else if (uffd_wp_resolve)
- newpmd = pmd_swp_clear_uffd_wp(newpmd);
+ newpmd = pmd_swp_clear_uffd(newpmd);
if (!pmd_same(*pmd, newpmd))
set_pmd_at(mm, addr, pmd, newpmd);
}
@@ -2753,14 +2753,14 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
entry = pmd_modify(oldpmd, newprot);
if (uffd_wp)
- entry = pmd_mkuffd_wp(entry);
+ entry = pmd_mkuffd(entry);
else if (uffd_wp_resolve)
/*
* Leave the write bit to be handled by PF interrupt
* handler, then things like COW could be properly
* handled.
*/
- entry = pmd_clear_uffd_wp(entry);
+ entry = pmd_clear_uffd(entry);
/* See change_pte_range(). */
if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
@@ -3103,8 +3103,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
entry = pfn_pte(zero_pfn(addr), vma->vm_page_prot);
entry = pte_mkspecial(entry);
- if (pmd_uffd_wp(old_pmd))
- entry = pte_mkuffd_wp(entry);
+ if (pmd_uffd(old_pmd))
+ entry = pte_mkuffd(entry);
VM_BUG_ON(!pte_none(ptep_get(pte)));
set_pte_at(mm, addr, pte, entry);
pte++;
@@ -3188,7 +3188,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
folio = page_folio(page);
soft_dirty = pmd_swp_soft_dirty(old_pmd);
- uffd_wp = pmd_swp_uffd_wp(old_pmd);
+ uffd_wp = pmd_swp_uffd(old_pmd);
write = softleaf_is_migration_write(entry);
if (PageAnon(page))
@@ -3204,7 +3204,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
folio = page_folio(page);
soft_dirty = pmd_swp_soft_dirty(old_pmd);
- uffd_wp = pmd_swp_uffd_wp(old_pmd);
+ uffd_wp = pmd_swp_uffd(old_pmd);
write = softleaf_is_device_private_write(entry);
anon_exclusive = PageAnonExclusive(page);
@@ -3261,7 +3261,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);
- uffd_wp = pmd_uffd_wp(old_pmd);
+ uffd_wp = pmd_uffd(old_pmd);
VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
@@ -3332,7 +3332,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
if (uffd_wp)
- entry = pte_swp_mkuffd_wp(entry);
+ entry = pte_swp_mkuffd(entry);
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
set_pte_at(mm, addr, pte + i, entry);
}
@@ -3359,7 +3359,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
if (uffd_wp)
- entry = pte_swp_mkuffd_wp(entry);
+ entry = pte_swp_mkuffd(entry);
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
set_pte_at(mm, addr, pte + i, entry);
}
@@ -3377,7 +3377,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
if (uffd_wp)
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
for (i = 0; i < HPAGE_PMD_NR; i++)
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
@@ -5018,8 +5018,8 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
pmdswp = swp_entry_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
- if (pmd_uffd_wp(pmdval))
- pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+ if (pmd_uffd(pmdval))
+ pmdswp = pmd_swp_mkuffd(pmdswp);
set_pmd_at(mm, address, pvmw->pmd, pmdswp);
folio_remove_rmap_pmd(folio, page, vma);
folio_put(folio);
@@ -5049,8 +5049,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mksoft_dirty(pmde);
if (softleaf_is_migration_write(entry))
pmde = pmd_mkwrite(pmde, vma);
- if (pmd_swp_uffd_wp(*pvmw->pmd))
- pmde = pmd_mkuffd_wp(pmde);
+ if (pmd_swp_uffd(*pvmw->pmd))
+ pmde = pmd_mkuffd(pmde);
if (!softleaf_is_migration_young(entry))
pmde = pmd_mkold(pmde);
/* NOTE: this may contain setting soft-dirty on some archs */
@@ -5070,8 +5070,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_swp_mksoft_dirty(pmde);
- if (pmd_swp_uffd_wp(*pvmw->pmd))
- pmde = pmd_swp_mkuffd_wp(pmde);
+ if (pmd_swp_uffd(*pvmw->pmd))
+ pmde = pmd_swp_mkuffd(pmde);
}
if (folio_test_anon(folio)) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 571212b80835..d0c81a056ae2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4843,8 +4843,8 @@ hugetlb_install_folio(struct vm_area_struct *vma, pte_t *ptep, unsigned long add
__folio_mark_uptodate(new_folio);
hugetlb_add_new_anon_rmap(new_folio, vma, addr);
- if (userfaultfd_wp(vma) && huge_pte_uffd_wp(old))
- newpte = huge_pte_mkuffd_wp(newpte);
+ if (userfaultfd_wp(vma) && huge_pte_uffd(old))
+ newpte = huge_pte_mkuffd(newpte);
set_huge_pte_at(vma->vm_mm, addr, ptep, newpte, sz);
hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
folio_set_hugetlb_migratable(new_folio);
@@ -4918,10 +4918,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = softleaf_from_pte(entry);
if (unlikely(softleaf_is_hwpoison(softleaf))) {
if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd_wp(entry);
+ entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(softleaf_is_migration(softleaf))) {
- bool uffd_wp = pte_swp_uffd_wp(entry);
+ bool uffd = pte_swp_uffd(entry);
if (!softleaf_is_migration_read(softleaf) && cow) {
/*
@@ -4931,12 +4931,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = make_readable_migration_entry(
swp_offset(softleaf));
entry = swp_entry_to_pte(softleaf);
- if (userfaultfd_wp(src_vma) && uffd_wp)
- entry = pte_swp_mkuffd_wp(entry);
+ if (userfaultfd_wp(src_vma) && uffd)
+ entry = pte_swp_mkuffd(entry);
set_huge_pte_at(src, addr, src_pte, entry, sz);
}
if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd_wp(entry);
+ entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(pte_is_marker(entry))) {
const pte_marker marker = copy_pte_marker(softleaf, dst_vma);
@@ -5013,7 +5013,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}
if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd_wp(entry);
+ entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
hugetlb_count_add(npages, dst);
@@ -5061,9 +5061,9 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
} else {
if (need_clear_uffd_wp) {
if (pte_present(pte))
- pte = huge_pte_clear_uffd_wp(pte);
+ pte = huge_pte_clear_uffd(pte);
else
- pte = pte_swp_clear_uffd_wp(pte);
+ pte = pte_swp_clear_uffd(pte);
}
set_huge_pte_at(mm, new_addr, dst_pte, pte, sz);
}
@@ -5197,7 +5197,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
* drop the uffd-wp bit in this zap, then replace the
* pte with a marker.
*/
- if (pte_swp_uffd_wp_any(pte) &&
+ if (pte_swp_uffd_any(pte) &&
!(zap_flags & ZAP_FLAG_DROP_MARKER))
set_huge_pte_at(mm, address, ptep,
make_pte_marker(PTE_MARKER_UFFD_WP),
@@ -5233,7 +5233,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
folio_mark_dirty(folio);
/* Leave a uffd-wp pte marker if needed */
- if (huge_pte_uffd_wp(pte) &&
+ if (huge_pte_uffd(pte) &&
!(zap_flags & ZAP_FLAG_DROP_MARKER))
set_huge_pte_at(mm, address, ptep,
make_pte_marker(PTE_MARKER_UFFD_WP),
@@ -5437,7 +5437,7 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
* can trigger this, because hugetlb_fault() will always resolve
* uffd-wp bit first.
*/
- if (!unshare && huge_pte_uffd_wp(pte))
+ if (!unshare && huge_pte_uffd(pte))
return 0;
/* Let's take out MAP_SHARED mappings first. */
@@ -5581,8 +5581,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
huge_ptep_clear_flush(vma, vmf->address, vmf->pte);
hugetlb_remove_rmap(old_folio);
hugetlb_add_new_anon_rmap(new_folio, vma, vmf->address);
- if (huge_pte_uffd_wp(pte))
- newpte = huge_pte_mkuffd_wp(newpte);
+ if (huge_pte_uffd(pte))
+ newpte = huge_pte_mkuffd(newpte);
set_huge_pte_at(mm, vmf->address, vmf->pte, newpte,
huge_page_size(h));
folio_set_hugetlb_migratable(new_folio);
@@ -5860,7 +5860,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
* if populated.
*/
if (unlikely(pte_is_uffd_wp_marker(vmf->orig_pte)))
- new_pte = huge_pte_mkuffd_wp(new_pte);
+ new_pte = huge_pte_mkuffd(new_pte);
set_huge_pte_at(mm, vmf->address, vmf->pte, new_pte, huge_page_size(h));
hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -6058,7 +6058,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_ptl;
/* Handle userfault-wp first, before trying to lock more pages */
- if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(mm, vmf.address, vmf.pte)) &&
+ if (userfaultfd_wp(vma) && huge_pte_uffd(huge_ptep_get(mm, vmf.address, vmf.pte)) &&
(flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) {
if (!userfaultfd_wp_async(vma)) {
spin_unlock(vmf.ptl);
@@ -6067,7 +6067,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_userfault(&vmf, VM_UFFD_WP);
}
- vmf.orig_pte = huge_pte_clear_uffd_wp(vmf.orig_pte);
+ vmf.orig_pte = huge_pte_clear_uffd(vmf.orig_pte);
set_huge_pte_at(mm, vmf.address, vmf.pte, vmf.orig_pte,
huge_page_size(hstate_vma(vma)));
/* Fallthrough to CoW */
@@ -6352,7 +6352,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
_dst_pte = pte_mkyoung(_dst_pte);
if (wp_enabled)
- _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+ _dst_pte = huge_pte_mkuffd(_dst_pte);
set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte, size);
@@ -6476,9 +6476,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
}
if (uffd_wp)
- newpte = pte_swp_mkuffd_wp(newpte);
+ newpte = pte_swp_mkuffd(newpte);
else if (uffd_wp_resolve)
- newpte = pte_swp_clear_uffd_wp(newpte);
+ newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(pte, newpte))
set_huge_pte_at(mm, address, ptep, newpte, psize);
} else if (unlikely(pte_is_marker(pte))) {
@@ -6499,9 +6499,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_modify(old_pte, newprot);
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
if (uffd_wp)
- pte = huge_pte_mkuffd_wp(pte);
+ pte = huge_pte_mkuffd(pte);
else if (uffd_wp_resolve)
- pte = huge_pte_clear_uffd_wp(pte);
+ pte = huge_pte_clear_uffd(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
tlb_remove_huge_tlb_entry(h, &tlb, ptep, address);
diff --git a/mm/internal.h b/mm/internal.h
index 5602393054f3..9325eefbea6a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,8 +412,8 @@ static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
new = pte_swp_mksoft_dirty(new);
if (pte_swp_exclusive(pte))
new = pte_swp_mkexclusive(new);
- if (pte_swp_uffd_wp(pte))
- new = pte_swp_mkuffd_wp(new);
+ if (pte_swp_uffd(pte))
+ new = pte_swp_mkuffd(new);
return new;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4549a020bf73..afa218be15de 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -37,7 +37,7 @@ enum scan_result {
SCAN_EXCEED_SWAP_PTE,
SCAN_EXCEED_SHARED_PTE,
SCAN_PTE_NON_PRESENT,
- SCAN_PTE_UFFD_WP,
+ SCAN_PTE_UFFD,
SCAN_PTE_MAPPED_HUGEPAGE,
SCAN_LACK_REFERENCED_PAGE,
SCAN_PAGE_NULL,
@@ -712,8 +712,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
result = SCAN_PTE_NON_PRESENT;
goto out;
}
- if (pte_uffd_wp(pteval)) {
- result = SCAN_PTE_UFFD_WP;
+ if (pte_uffd(pteval)) {
+ result = SCAN_PTE_UFFD;
goto out;
}
page = vm_normal_page(vma, addr, pteval);
@@ -1566,7 +1566,7 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
case SCAN_PAGE_NULL:
case SCAN_DEL_PAGE_LRU:
case SCAN_PTE_NON_PRESENT:
- case SCAN_PTE_UFFD_WP:
+ case SCAN_PTE_UFFD:
case SCAN_ALLOC_HUGE_PAGE_FAIL:
case SCAN_PAGE_LAZYFREE:
goto next_order;
@@ -1666,15 +1666,15 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
/*
* Always be strict with uffd-wp
* enabled swap entries. Please see
- * comment below for pte_uffd_wp().
+ * comment below for pte_uffd().
*/
- if (pte_swp_uffd_wp_any(pteval)) {
- result = SCAN_PTE_UFFD_WP;
+ if (pte_swp_uffd_any(pteval)) {
+ result = SCAN_PTE_UFFD;
goto out_unmap;
}
continue;
}
- if (pte_uffd_wp(pteval)) {
+ if (pte_uffd(pteval)) {
/*
* Don't collapse the page if any of the small
* PTEs are armed with uffd write protection.
@@ -1684,7 +1684,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
* userfault messages that falls outside of
* the registered range. So, just be simple.
*/
- result = SCAN_PTE_UFFD_WP;
+ result = SCAN_PTE_UFFD;
goto out_unmap;
}
@@ -1897,7 +1897,7 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
if (userfaultfd_wp(vma))
- return SCAN_PTE_UFFD_WP;
+ return SCAN_PTE_UFFD;
folio = filemap_lock_folio(vma->vm_file->f_mapping,
linear_page_index(vma, haddr));
@@ -3244,7 +3244,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
/* Whitelisted set of results where continuing OK */
case SCAN_NO_PTE_TABLE:
case SCAN_PTE_NON_PRESENT:
- case SCAN_PTE_UFFD_WP:
+ case SCAN_PTE_UFFD:
case SCAN_LACK_REFERENCED_PAGE:
case SCAN_PAGE_NULL:
case SCAN_PAGE_COUNT:
diff --git a/mm/memory.c b/mm/memory.c
index 7c020995eafc..c4fd5cb4a08f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -893,8 +893,8 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(orig_pte))
- pte = pte_mkuffd_wp(pte);
+ if (pte_swp_uffd(orig_pte))
+ pte = pte_mkuffd(pte);
if ((vma->vm_flags & VM_WRITE) &&
can_change_pte_writable(vma, address, pte)) {
@@ -984,8 +984,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte = softleaf_to_pte(entry);
if (pte_swp_soft_dirty(orig_pte))
pte = pte_swp_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(orig_pte))
- pte = pte_swp_mkuffd_wp(pte);
+ if (pte_swp_uffd(orig_pte))
+ pte = pte_swp_mkuffd(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (softleaf_is_device_private(entry)) {
@@ -1018,8 +1018,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
entry = make_readable_device_private_entry(
swp_offset(entry));
pte = swp_entry_to_pte(entry);
- if (pte_swp_uffd_wp(orig_pte))
- pte = pte_swp_mkuffd_wp(pte);
+ if (pte_swp_uffd(orig_pte))
+ pte = pte_swp_mkuffd(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (softleaf_is_device_exclusive(entry)) {
@@ -1042,7 +1042,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return 0;
}
if (!userfaultfd_wp(dst_vma))
- pte = pte_swp_clear_uffd_wp(pte);
+ pte = pte_swp_clear_uffd(pte);
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
}
@@ -1090,7 +1090,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte)))
/* Uffd-wp needs to be delivered to dest pte as well */
- pte = pte_mkuffd_wp(pte);
+ pte = pte_mkuffd(pte);
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
return 0;
}
@@ -1113,7 +1113,7 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
pte = pte_mkold(pte);
if (!userfaultfd_wp(dst_vma))
- pte = pte_clear_uffd_wp(pte);
+ pte = pte_clear_uffd(pte);
set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
}
@@ -3925,8 +3925,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (unlikely(unshare)) {
if (pte_soft_dirty(vmf->orig_pte))
entry = pte_mksoft_dirty(entry);
- if (pte_uffd_wp(vmf->orig_pte))
- entry = pte_mkuffd_wp(entry);
+ if (pte_uffd(vmf->orig_pte))
+ entry = pte_mkuffd(entry);
} else {
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
@@ -4261,7 +4261,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
* etc.) because we're only removing the uffd-wp bit,
* which is completely invisible to the user.
*/
- pte = pte_clear_uffd_wp(ptep_get(vmf->pte));
+ pte = pte_clear_uffd(ptep_get(vmf->pte));
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
/*
@@ -5038,8 +5038,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte = mk_pte(page, vma->vm_page_prot);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(vmf->orig_pte))
- pte = pte_mkuffd_wp(pte);
+ if (pte_swp_uffd(vmf->orig_pte))
+ pte = pte_mkuffd(pte);
/*
* Same logic as in do_wp_page(); however, optimize for pages that are
@@ -5255,7 +5255,7 @@ void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry), vma);
if (uffd_wp)
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
folio_ref_add(folio, nr_pages - 1);
folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
@@ -5322,7 +5322,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_MISSING);
}
if (vmf_orig_pte_uffd_wp(vmf))
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
/* No need to invalidate - it was non-present before */
@@ -5572,7 +5572,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
else if (pte_write(entry) && folio_test_dirty(folio))
entry = pte_mkdirty(entry);
if (unlikely(vmf_orig_pte_uffd_wp(vmf)))
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
VM_BUG_ON_FOLIO(nr != 1, folio);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0c6a0ab6ecce..4bdb5be7afbf 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -326,8 +326,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
if (pte_swp_soft_dirty(old_pte))
newpte = pte_mksoft_dirty(newpte);
- if (pte_swp_uffd_wp(old_pte))
- newpte = pte_mkuffd_wp(newpte);
+ if (pte_swp_uffd(old_pte))
+ newpte = pte_mkuffd(newpte);
set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
@@ -391,8 +391,8 @@ static bool remove_migration_pte(struct folio *folio,
if (softleaf_is_migration_write(entry))
pte = pte_mkwrite(pte, vma);
- else if (pte_swp_uffd_wp(old_pte))
- pte = pte_mkuffd_wp(pte);
+ else if (pte_swp_uffd(old_pte))
+ pte = pte_mkuffd(pte);
if (folio_test_anon(folio) && !softleaf_is_migration_read(entry))
rmap_flags |= RMAP_EXCLUSIVE;
@@ -407,8 +407,8 @@ static bool remove_migration_pte(struct folio *folio,
pte = softleaf_to_pte(entry);
if (pte_swp_soft_dirty(old_pte))
pte = pte_swp_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(old_pte))
- pte = pte_swp_mkuffd_wp(pte);
+ if (pte_swp_uffd(old_pte))
+ pte = pte_swp_mkuffd(pte);
}
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 554754eb26ff..17da1bab0248 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -445,13 +445,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (pte_present(pte)) {
if (pte_soft_dirty(pte))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pte))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_uffd(pte))
+ swp_pte = pte_swp_mkuffd(swp_pte);
} else {
if (pte_swp_soft_dirty(pte))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pte))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_swp_uffd(pte))
+ swp_pte = pte_swp_mkuffd(swp_pte);
}
set_pte_at(mm, addr, ptep, swp_pte);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9cbf932b028c..8340c8b228c6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -240,8 +240,8 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
*/
entry = make_readable_device_private_entry(swp_offset(entry));
newpte = swp_entry_to_pte(entry);
- if (pte_swp_uffd_wp(oldpte))
- newpte = pte_swp_mkuffd_wp(newpte);
+ if (pte_swp_uffd(oldpte))
+ newpte = pte_swp_mkuffd(newpte);
} else if (softleaf_is_marker(entry)) {
/*
* Ignore error swap entries unconditionally,
@@ -266,9 +266,9 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
}
if (uffd_wp)
- newpte = pte_swp_mkuffd_wp(newpte);
+ newpte = pte_swp_mkuffd(newpte);
else if (uffd_wp_resolve)
- newpte = pte_swp_clear_uffd_wp(newpte);
+ newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(oldpte, newpte)) {
set_pte_at(vma->vm_mm, addr, pte, newpte);
@@ -290,9 +290,9 @@ static __always_inline void change_present_ptes(struct mmu_gather *tlb,
ptent = pte_modify(oldpte, newprot);
if (uffd_wp)
- ptent = pte_mkuffd_wp(ptent);
+ ptent = pte_mkuffd(ptent);
else if (uffd_wp_resolve)
- ptent = pte_clear_uffd_wp(ptent);
+ ptent = pte_clear_uffd(ptent);
/*
* In some writable, shared mappings, we might want
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832..12732a5c547e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -297,9 +297,9 @@ static int move_ptes(struct pagetable_move_control *pmc,
else {
if (need_clear_uffd_wp) {
if (pte_present(pte))
- pte = pte_clear_uffd_wp(pte);
+ pte = pte_clear_uffd(pte);
else
- pte = pte_swp_clear_uffd_wp(pte);
+ pte = pte_swp_clear_uffd(pte);
}
set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
}
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 53a8997ec043..3fb995e5d40d 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -188,8 +188,8 @@ static inline bool softleaf_cached_writable(softleaf_t entry)
static void page_table_check_pte_flags(pte_t pte)
{
if (pte_present(pte)) {
- WARN_ON_ONCE(pte_uffd_wp(pte) && pte_write(pte));
- } else if (pte_swp_uffd_wp(pte)) {
+ WARN_ON_ONCE(pte_uffd(pte) && pte_write(pte));
+ } else if (pte_swp_uffd(pte)) {
const softleaf_t entry = softleaf_from_pte(pte);
WARN_ON_ONCE(softleaf_cached_writable(entry));
@@ -216,9 +216,9 @@ EXPORT_SYMBOL(__page_table_check_ptes_set);
static inline void page_table_check_pmd_flags(pmd_t pmd)
{
if (pmd_present(pmd)) {
- if (pmd_uffd_wp(pmd))
+ if (pmd_uffd(pmd))
WARN_ON_ONCE(pmd_write(pmd));
- } else if (pmd_swp_uffd_wp(pmd)) {
+ } else if (pmd_swp_uffd(pmd)) {
const softleaf_t entry = softleaf_from_pmd(pmd);
WARN_ON_ONCE(softleaf_cached_writable(entry));
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c77d5dc06e9..546bc1cf9391 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2318,13 +2318,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (likely(pte_present(pteval))) {
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
} else {
if (pte_swp_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_swp_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
}
set_pte_at(mm, address, pvmw.pte, swp_pte);
} else {
@@ -2692,14 +2692,14 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
} else {
swp_pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_swp_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
}
if (folio_test_hugetlb(folio))
set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3d126602a1e..15fdca2da1f7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2557,8 +2557,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
if (pte_swp_soft_dirty(old_pte))
new_pte = pte_mksoft_dirty(new_pte);
- if (pte_swp_uffd_wp(old_pte))
- new_pte = pte_mkuffd_wp(new_pte);
+ if (pte_swp_uffd(old_pte))
+ new_pte = pte_mkuffd(new_pte);
setpte:
set_pte_at(vma->vm_mm, addr, pte, new_pte);
folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6d2a1c67019..9d74be69873a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -394,7 +394,7 @@ static int mfill_atomic_install_pte(pmd_t *dst_pmd,
if (writable)
_dst_pte = pte_mkwrite(_dst_pte, dst_vma);
if (flags & MFILL_ATOMIC_WP)
- _dst_pte = pte_mkuffd_wp(_dst_pte);
+ _dst_pte = pte_mkuffd(_dst_pte);
ret = -EAGAIN;
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
@@ -3591,7 +3591,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
vm_flags |= VM_UFFD_MISSING;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
- if (!pgtable_supports_uffd_wp())
+ if (!pgtable_supports_uffd())
goto out;
vm_flags |= VM_UFFD_WP;
@@ -4301,7 +4301,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
uffdio_api.features &=
~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
#endif
- if (!pgtable_supports_uffd_wp())
+ if (!pgtable_supports_uffd())
uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
if (!uffd_supports_wp_marker()) {
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (2 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 03/15] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-06-02 10:07 ` Mike Rapoport
2026-06-03 12:54 ` Lorenzo Stoakes
2026-05-29 17:26 ` [PATCH v6 05/15] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau (Meta)
` (10 subsequent siblings)
14 siblings, 2 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
The uffd VMA-flag helpers read vma->vm_flags directly. Now that
config-gated per-mode masks exist, switch them to the vma_flags_t
accessor vma_test_any_mask(), which is the going-forward API and keeps a
single place (the VMA_UFFD_* masks) that knows which modes are available
on the current build.
No functional change: vma_flags_t is in union with vm_flags, so the same
bits are read, and the masks fold to the same code the open-coded
vm_flags tests produced -- verified identical on gcc and clang, 32- and
64-bit.
Suggested-by: Lorenzo Stoakes <ljs@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-8
---
include/linux/userfaultfd_k.h | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 658740df2978..c4f2cc6dfcf0 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -178,7 +178,8 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
*/
static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
{
- return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+ return vma_test_any_mask(vma,
+ mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
}
/*
@@ -190,22 +191,23 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
*/
static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
{
- return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+ return vma_test_any_mask(vma,
+ mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
}
static inline bool userfaultfd_missing(struct vm_area_struct *vma)
{
- return vma->vm_flags & VM_UFFD_MISSING;
+ return vma_test_any_mask(vma, VMA_UFFD_MISSING);
}
static inline bool userfaultfd_wp(struct vm_area_struct *vma)
{
- return vma->vm_flags & VM_UFFD_WP;
+ return vma_test_any_mask(vma, VMA_UFFD_WP);
}
static inline bool userfaultfd_minor(struct vm_area_struct *vma)
{
- return vma->vm_flags & VM_UFFD_MINOR;
+ return vma_test_any_mask(vma, VMA_UFFD_MINOR);
}
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
@@ -222,7 +224,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
- return vma->vm_flags & __VM_UFFD_FLAGS;
+ return vma_test_any_mask(vma, __VMA_UFFD_FLAGS);
}
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 05/15] mm: add VM_UFFD_RWP VMA flag
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (3 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-06-03 12:52 ` Lorenzo Stoakes
2026-05-29 17:26 ` [PATCH v6 06/15] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau (Meta)
` (9 subsequent siblings)
14 siblings, 1 reply; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Preparatory patch for userfaultfd read-write protection (RWP). RWP
extends userfaultfd protection from plain write-protection (WP) to
full read-write protection: accesses to an RWP-protected range --
reads as well as writes -- trap through userfaultfd.
Reserve VM_UFFD_RWP, add the userfaultfd_rwp() and
userfaultfd_protected() helpers, and wire up the smaps "ur" entry and
the trace-flag table the rest of the series will use. The flag is
gated on CONFIG_USERFAULTFD_RWP, which is introduced together with the
UAPI in a later patch; until then VM_UFFD_RWP aliases VM_NONE and
every downstream check folds to dead code.
Nothing sets or queries the flag yet.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
Documentation/filesystems/proc.rst | 1 +
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 41 ++++++++++++++++++++----------
include/linux/userfaultfd_k.h | 32 +++++++++++++++++++----
include/trace/events/mmflags.h | 7 +++++
5 files changed, 65 insertions(+), 19 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index db6167befb7b..db28207c5290 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -607,6 +607,7 @@ encoded manner. The codes are the following:
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
ui userfaultfd minor fault
+ ur userfaultfd read-write-protect tracking
ss shadow/guarded control stack page
sl sealed
lf lock on fault pages
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 939657aa334a..ca0f69b347e8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1237,6 +1237,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_USERFAULTFD_RWP
+ [ilog2(VM_UFFD_RWP)] = "ur",
+#endif
#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
[ilog2(VM_SHADOW_STACK)] = "ss",
#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbdd..5ac31fbadeef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -353,6 +353,7 @@ enum {
#endif
DECLARE_VMA_BIT(UFFD_MINOR, 41),
DECLARE_VMA_BIT(SEALED, 42),
+ DECLARE_VMA_BIT(UFFD_RWP, 43),
/* Flags that reuse flags above. */
DECLARE_VMA_BIT_ALIAS(PKEY_BIT0, HIGH_ARCH_0),
DECLARE_VMA_BIT_ALIAS(PKEY_BIT1, HIGH_ARCH_1),
@@ -496,12 +497,17 @@ enum {
#else
#define VM_UFFD_MINOR VM_NONE
#endif
+#ifdef CONFIG_USERFAULTFD_RWP
+#define VM_UFFD_RWP INIT_VM_FLAG(UFFD_RWP)
+#else
+#define VM_UFFD_RWP VM_NONE
+#endif
/*
- * vma_flags_t masks for the userfaultfd VMA flags. VMA_UFFD_MINOR is gated on
- * the same config as VM_UFFD_MINOR -- which implies 64BIT, where the bit fits
- * -- so an out-of-range bit is never fed to mk_vma_flags() on a build whose
- * bitmap cannot hold it.
+ * vma_flags_t masks for the userfaultfd VMA flags. The two high-bit modes are
+ * gated on the same configs as their VM_* flags above -- both of which imply
+ * 64BIT -- so an out-of-range bit is never fed to mk_vma_flags() on a build
+ * whose bitmap cannot hold it.
*/
#define VMA_UFFD_MISSING mk_vma_flags(VMA_UFFD_MISSING_BIT)
#define VMA_UFFD_WP mk_vma_flags(VMA_UFFD_WP_BIT)
@@ -510,6 +516,11 @@ enum {
#else
#define VMA_UFFD_MINOR EMPTY_VMA_FLAGS
#endif
+#ifdef CONFIG_USERFAULTFD_RWP
+#define VMA_UFFD_RWP mk_vma_flags(VMA_UFFD_RWP_BIT)
+#else
+#define VMA_UFFD_RWP EMPTY_VMA_FLAGS
+#endif
#ifdef CONFIG_64BIT
#define VM_ALLOW_ANY_UNCACHED INIT_VM_FLAG(ALLOW_ANY_UNCACHED)
@@ -648,22 +659,24 @@ enum {
* reconsistuted upon page fault, so necessitate page table copying upon fork.
*
* Note that these flags should be compared with the DESTINATION VMA not the
- * source, as VM_UFFD_WP may not be propagated to destination, while all other
- * flags will be.
+ * source: VM_UFFD_WP and VM_UFFD_RWP may be cleared on the destination
+ * (dup_userfaultfd() -> userfaultfd_reset_ctx() when the parent context did
+ * not negotiate UFFD_FEATURE_EVENT_FORK), while all other flags propagate.
*
* VM_PFNMAP / VM_MIXEDMAP - These contain kernel-mapped data which cannot be
* reasonably reconstructed on page fault.
*
* VM_UFFD_WP - Encodes metadata about an installed uffd
- * write protect handler, which cannot be
- * reconstructed on page fault.
+ * VM_UFFD_RWP write- or read-write-protect handler, which
+ * cannot be reconstructed on page fault.
*
- * We always copy pgtables when dst_vma has uffd-wp
- * enabled even if it's file-backed
- * (e.g. shmem). Because when uffd-wp is enabled,
- * pgtable contains uffd-wp protection information,
- * that's something we can't retrieve from page cache,
- * and skip copying will lose those info.
+ * We always copy pgtables when dst_vma has the
+ * uffd PTE bit in use even if it's file-backed
+ * (e.g. shmem). Because when the uffd bit is
+ * in use, the pgtable contains the protection
+ * information, that's something we can't
+ * retrieve from page cache, and skip copying
+ * will lose those info.
*
* VM_MAYBE_GUARD - Could contain page guard region markers which
* by design are a property of the page tables
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c4f2cc6dfcf0..f3b2db27989b 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -21,10 +21,11 @@
#include <linux/hugetlb_inline.h>
/* The set of all possible UFFD-related VM flags. */
-#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
+#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_MINOR | \
+ VM_UFFD_WP | VM_UFFD_RWP)
#define __VMA_UFFD_FLAGS mk_vma_flags_from_masks(VMA_UFFD_MISSING, VMA_UFFD_WP, \
- VMA_UFFD_MINOR)
+ VMA_UFFD_MINOR, VMA_UFFD_RWP)
/*
* CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -179,7 +180,8 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
{
return vma_test_any_mask(vma,
- mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
+ mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR,
+ VMA_UFFD_RWP));
}
/*
@@ -210,6 +212,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
return vma_test_any_mask(vma, VMA_UFFD_MINOR);
}
+static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
+{
+ return vma_test_any_mask(vma, VMA_UFFD_RWP);
+}
+
+static inline bool userfaultfd_protected(struct vm_area_struct *vma)
+{
+ return userfaultfd_wp(vma) || userfaultfd_rwp(vma);
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
@@ -330,6 +342,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
return false;
}
+static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
+{
+ return false;
+}
+
+static inline bool userfaultfd_protected(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
@@ -423,8 +445,8 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
}
/*
- * Returns true if this is a swap pte and was uffd-wp wr-protected in either
- * forms (pte marker or a normal swap pte), false otherwise.
+ * Returns true if this swap pte carries uffd-tracked state in either
+ * form (pte marker or a normal swap pte), false otherwise.
*/
static inline bool pte_swp_uffd_any(pte_t pte)
{
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a6e5a44c9b42..bfface3d0203 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -194,6 +194,12 @@ IF_HAVE_PG_ARCH_3(arch_3)
# define IF_HAVE_UFFD_MINOR(flag, name)
#endif
+#ifdef CONFIG_USERFAULTFD_RWP
+# define IF_HAVE_UFFD_RWP(flag, name) {flag, name},
+#else
+# define IF_HAVE_UFFD_RWP(flag, name)
+#endif
+
#if defined(CONFIG_64BIT) || defined(CONFIG_PPC32)
# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
#else
@@ -215,6 +221,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \
{VM_PFNMAP, "pfnmap" }, \
{VM_MAYBE_GUARD, "maybe_guard" }, \
{VM_UFFD_WP, "uffd_wp" }, \
+IF_HAVE_UFFD_RWP(VM_UFFD_RWP, "uffd_rwp" ) \
{VM_LOCKED, "locked" }, \
{VM_IO, "io" }, \
{VM_SEQ_READ, "seqread" }, \
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 06/15] mm: add MM_CP_UFFD_RWP change_protection() flag
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (4 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 05/15] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 07/15] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau (Meta)
` (8 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Preparatory patch. Add the change_protection() primitive that
userfaultfd RWP will use.
An RWP-protected PTE is PAGE_NONE with the uffd PTE bit set. The
PROT_NONE half makes the CPU fault on any access; the uffd bit
distinguishes an RWP fault from a plain mprotect(PROT_NONE) or NUMA
hinting fault. MM_CP_UFFD_WP and MM_CP_UFFD_RWP share the same PTE
bit, so the two cannot be used together on the same range.
Two new change_protection() flags:
MM_CP_UFFD_RWP install PAGE_NONE and set the uffd bit
MM_CP_UFFD_RWP_RESOLVE restore vma->vm_page_prot, clear the uffd bit
Both are wired through change_pte_range(), change_huge_pmd(), and
hugetlb_change_protection() so anon, shmem, THP, and hugetlb all
share the same semantics.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
include/linux/mm.h | 5 ++++
include/linux/userfaultfd_k.h | 1 -
mm/huge_memory.c | 30 ++++++++++++----------
mm/hugetlb.c | 25 +++++++++++++-----
mm/mprotect.c | 48 +++++++++++++++++++++++++++--------
5 files changed, 78 insertions(+), 31 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ac31fbadeef..87b2fb1e3f23 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3330,6 +3330,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
#define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)
+/* Whether this change is for uffd RWP */
+#define MM_CP_UFFD_RWP (1UL << 4) /* do rwp */
+#define MM_CP_UFFD_RWP_RESOLVE (1UL << 5) /* resolve rwp */
+#define MM_CP_UFFD_RWP_ALL (MM_CP_UFFD_RWP | \
+ MM_CP_UFFD_RWP_RESOLVE)
bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index f3b2db27989b..5115827981a2 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -364,7 +364,6 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
return false;
}
-
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d43c2255f47d..40c65bf2d6dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2640,8 +2640,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
}
static void change_non_present_huge_pmd(struct mm_struct *mm,
- unsigned long addr, pmd_t *pmd, bool uffd_wp,
- bool uffd_wp_resolve)
+ unsigned long addr, pmd_t *pmd, bool uffd_prot,
+ bool uffd_prot_resolve)
{
softleaf_t entry = softleaf_from_pmd(*pmd);
const struct folio *folio = softleaf_to_folio(entry);
@@ -2669,9 +2669,9 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
newpmd = *pmd;
}
- if (uffd_wp)
+ if (uffd_prot)
newpmd = pmd_swp_mkuffd(newpmd);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
newpmd = pmd_swp_clear_uffd(newpmd);
if (!pmd_same(*pmd, newpmd))
set_pmd_at(mm, addr, pmd, newpmd);
@@ -2692,8 +2692,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spinlock_t *ptl;
pmd_t oldpmd, entry;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
- bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
- bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ bool uffd_prot = cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP);
+ bool uffd_prot_resolve = cp_flags &
+ (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE);
int ret = 1;
tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
@@ -2706,11 +2707,17 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return 0;
if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) {
- change_non_present_huge_pmd(mm, addr, pmd, uffd_wp,
- uffd_wp_resolve);
+ change_non_present_huge_pmd(mm, addr, pmd, uffd_prot,
+ uffd_prot_resolve);
goto unlock;
}
+ /* Already in the desired state */
+ if (prot_numa && pmd_protnone(*pmd))
+ goto unlock;
+ if ((cp_flags & MM_CP_UFFD_RWP) && pmd_protnone(*pmd) && pmd_uffd(*pmd))
+ goto unlock;
+
if (prot_numa) {
/*
@@ -2721,9 +2728,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(*pmd))
goto unlock;
- if (pmd_protnone(*pmd))
- goto unlock;
-
if (!folio_can_map_prot_numa(pmd_folio(*pmd), vma,
vma_is_single_threaded_private(vma)))
goto unlock;
@@ -2752,9 +2756,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
entry = pmd_modify(oldpmd, newprot);
- if (uffd_wp)
+ if (uffd_prot)
entry = pmd_mkuffd(entry);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
/*
* Leave the write bit to be handled by PF interrupt
* handler, then things like COW could be properly
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0c81a056ae2..4d75b69d4272 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6395,6 +6395,8 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long last_addr_mask;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ bool uffd_rwp = cp_flags & MM_CP_UFFD_RWP;
+ bool uffd_rwp_resolve = cp_flags & MM_CP_UFFD_RWP_RESOLVE;
struct mmu_gather tlb;
/*
@@ -6420,6 +6422,11 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
ptep = hugetlb_walk(vma, address, psize);
if (!ptep) {
+ /*
+ * uffd_wp installs a pte marker on the unpopulated
+ * entry; uffd_rwp does not install markers so the
+ * allocation is unnecessary for it.
+ */
if (!uffd_wp) {
address |= last_addr_mask;
continue;
@@ -6441,7 +6448,8 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
* shouldn't happen at all. Warn about it if it
* happened due to some reason.
*/
- WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
+ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve ||
+ uffd_rwp || uffd_rwp_resolve);
pages++;
spin_unlock(ptl);
address |= last_addr_mask;
@@ -6475,9 +6483,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pages++;
}
- if (uffd_wp)
+ if (uffd_wp || uffd_rwp)
newpte = pte_swp_mkuffd(newpte);
- else if (uffd_wp_resolve)
+ else if (uffd_wp_resolve || uffd_rwp_resolve)
newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(pte, newpte))
set_huge_pte_at(mm, address, ptep, newpte, psize);
@@ -6488,19 +6496,24 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
* pte_marker_uffd_wp()==true implies !poison
* because they're mutual exclusive.
*/
- if (pte_is_uffd_wp_marker(pte) && uffd_wp_resolve)
+ if (pte_is_uffd_wp_marker(pte) &&
+ (uffd_wp_resolve || uffd_rwp_resolve))
/* Safe to modify directly (non-present->none). */
huge_pte_clear(mm, address, ptep, psize);
} else {
pte_t old_pte;
unsigned int shift = huge_page_shift(hstate_vma(vma));
+ /* Already protnone with uffd bit set? Nothing to do. */
+ if (uffd_rwp && pte_protnone(pte) && huge_pte_uffd(pte))
+ goto next;
+
old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
pte = huge_pte_modify(old_pte, newprot);
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
- if (uffd_wp)
+ if (uffd_wp || uffd_rwp)
pte = huge_pte_mkuffd(pte);
- else if (uffd_wp_resolve)
+ else if (uffd_wp_resolve || uffd_rwp_resolve)
pte = huge_pte_clear_uffd(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8340c8b228c6..7dcc94e7bfd6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -214,8 +214,9 @@ static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_stru
static long change_softleaf_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags)
{
- const bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
- const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ const bool uffd_prot = cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP);
+ const bool uffd_prot_resolve = cp_flags &
+ (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE);
softleaf_t entry = softleaf_from_pte(oldpte);
pte_t newpte;
@@ -256,7 +257,7 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
* to unprotect it, drop it; the next page
* fault will trigger without uffd trapping.
*/
- if (uffd_wp_resolve) {
+ if (uffd_prot_resolve) {
pte_clear(vma->vm_mm, addr, pte);
return 1;
}
@@ -265,9 +266,9 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
newpte = oldpte;
}
- if (uffd_wp)
+ if (uffd_prot)
newpte = pte_swp_mkuffd(newpte);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(oldpte, newpte)) {
@@ -282,16 +283,17 @@ static __always_inline void change_present_ptes(struct mmu_gather *tlb,
int nr_ptes, unsigned long end, pgprot_t newprot,
struct folio *folio, struct page *page, unsigned long cp_flags)
{
- const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
- const bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ const bool uffd_prot = cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP);
+ const bool uffd_prot_resolve = cp_flags &
+ (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE);
pte_t ptent, oldpte;
oldpte = modify_prot_start_ptes(vma, addr, ptep, nr_ptes);
ptent = pte_modify(oldpte, newprot);
- if (uffd_wp)
+ if (uffd_prot)
ptent = pte_mkuffd(ptent);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
ptent = pte_clear_uffd(ptent);
/*
@@ -325,6 +327,7 @@ static long change_pte_range(struct mmu_gather *tlb,
long pages = 0;
bool is_private_single_threaded;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+ bool uffd_rwp = cp_flags & MM_CP_UFFD_RWP;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
int nr_ptes;
@@ -350,6 +353,14 @@ static long change_pte_range(struct mmu_gather *tlb,
/* Already in the desired state. */
if (prot_numa && pte_protnone(oldpte))
continue;
+ /*
+ * RWP-protected PTEs carry _PAGE_UFFD as a marker on
+ * top of PROT_NONE. Skip only entries already in that
+ * exact state; plain PROT_NONE from mprotect() still needs
+ * to be promoted so future faults can be distinguished.
+ */
+ if (uffd_rwp && pte_protnone(oldpte) && pte_uffd(oldpte))
+ continue;
page = vm_normal_page(vma, addr, oldpte);
if (page)
@@ -358,6 +369,8 @@ static long change_pte_range(struct mmu_gather *tlb,
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
+ * Skip this filter for uffd RWP which
+ * must set protnone regardless of NUMA placement.
*/
if (prot_numa &&
!folio_can_map_prot_numa(folio, vma,
@@ -428,7 +441,7 @@ pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags)
* (e.g. 2M shmem) because file thp is handled differently when
* split by erasing the pmd so far.
*/
- return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
+ return (cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP)) && !vma_is_anonymous(vma);
}
/*
@@ -667,7 +680,16 @@ long change_protection(struct mmu_gather *tlb,
pgprot_t newprot = vma->vm_page_prot;
long pages;
- BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
+ /*
+ * MM_CP_UFFD_{WP,RWP} and _RESOLVE are mutually exclusive within one
+ * change, and WP and RWP cannot mix. Miswired callers get a warn and
+ * a no-op; userspace cannot reach this state.
+ */
+ if (WARN_ON_ONCE((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL ||
+ (cp_flags & MM_CP_UFFD_RWP_ALL) == MM_CP_UFFD_RWP_ALL ||
+ ((cp_flags & MM_CP_UFFD_WP_ALL) &&
+ (cp_flags & MM_CP_UFFD_RWP_ALL))))
+ return 0;
#ifdef CONFIG_NUMA_BALANCING
/*
@@ -681,6 +703,10 @@ long change_protection(struct mmu_gather *tlb,
WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
#endif
+ if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE) &&
+ (cp_flags & MM_CP_UFFD_RWP))
+ newprot = PAGE_NONE;
+
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot,
cp_flags);
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 07/15] mm: preserve RWP marker across PTE rewrites
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (5 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 06/15] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 08/15] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau (Meta)
` (7 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
The uffd PTE bit must survive any kernel path that rewrites a PTE
on a VM_UFFD_RWP VMA, otherwise the marker that carries PAGE_NONE
semantics is silently dropped and the next access leaks past RWP
tracking. Wire the preservation through every path that rewrites a
VM_UFFD_RWP PTE.
Swap and device-exclusive: do_swap_page(), restore_exclusive_pte(),
and unuse_pte() (swapoff()) re-apply PAGE_NONE when the swap PTE
carries the uffd bit and the VMA has VM_UFFD_RWP.
Migration: remove_migration_pte() and remove_migration_pmd() do the
same after the migration entry is replaced with a real PTE/PMD.
Fork: __copy_present_ptes(), copy_present_page(), copy_nonpresent_pte(),
copy_huge_pmd(), copy_huge_non_present_pmd(), and
copy_hugetlb_page_range() keep the uffd bit on the child when the
destination VMA has VM_UFFD_RWP, matching the existing VM_UFFD_WP
handling. Add VM_UFFD_RWP to VM_COPY_ON_FORK so the flag itself
propagates.
mprotect(): change_pte_range() and change_huge_pmd() restore PAGE_NONE
after pte_modify()/pmd_modify() have recomputed the base protection
from a (possibly user-changed) vm_page_prot. pte_modify() preserves
_PAGE_UFFD, so the bit stays; we just have to force PAGE_NONE back
on top.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 3 ++-
mm/huge_memory.c | 47 +++++++++++++++++++++++++++++++++++++----
mm/hugetlb.c | 52 ++++++++++++++++++++++++++++++++++++++--------
mm/memory.c | 49 ++++++++++++++++++++++++++++++++++++-------
mm/migrate.c | 8 +++++++
mm/mprotect.c | 10 +++++++++
mm/mremap.c | 13 ++++++++++--
mm/swapfile.c | 5 +++++
mm/userfaultfd.c | 17 +++++++++++++++
9 files changed, 181 insertions(+), 23 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 87b2fb1e3f23..3d4d5f9a6f1b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -683,7 +683,8 @@ enum {
* only and thus cannot be reconstructed on page
* fault.
*/
-#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_MAYBE_GUARD)
+#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_UFFD_RWP | \
+ VM_MAYBE_GUARD)
/*
* mapping from the currently active vm_flags protection bits (the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40c65bf2d6dc..6417d883d2e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1943,7 +1943,7 @@ static void copy_huge_non_present_pmd(
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
pmd = pmd_swp_clear_uffd(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
}
@@ -2038,9 +2038,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
out_zero_page:
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- pmdp_set_wrprotect(src_mm, addr, src_pmd);
- if (!userfaultfd_wp(dst_vma))
+
+ /* See __copy_present_ptes(): restore accessible protection. */
+ if (!userfaultfd_protected(dst_vma)) {
+ if (userfaultfd_rwp(src_vma) && pmd_uffd(pmd))
+ pmd = pmd_modify(pmd, dst_vma->vm_page_prot);
pmd = pmd_clear_uffd(pmd);
+ }
+
+ pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_wrprotect(pmd);
set_pmd:
pmd = pmd_mkold(pmd);
@@ -2626,8 +2632,16 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
}
pmd = move_soft_dirty_pmd(pmd);
- if (vma_has_uffd_without_event_remap(vma))
+ if (vma_has_uffd_without_event_remap(vma)) {
+ /*
+ * See __copy_present_ptes(): normalise RWP PMDs so
+ * the destination starts accessible instead of taking
+ * a numa-hinting fault on first access.
+ */
+ if (pmd_present(pmd) && userfaultfd_rwp(vma))
+ pmd = pmd_modify(pmd, vma->vm_page_prot);
pmd = clear_uffd_wp_pmd(pmd);
+ }
set_pmd_at(mm, new_addr, new_pmd, pmd);
if (force_flush)
flush_pmd_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
@@ -2766,6 +2780,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
*/
entry = pmd_clear_uffd(entry);
+ /* See change_pte_range(): preserve RWP protection across mprotect() */
+ if (userfaultfd_rwp(vma) && pmd_uffd(entry))
+ entry = pmd_modify(entry, PAGE_NONE);
+
/* See change_pte_range(). */
if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
can_change_pmd_writable(vma, addr, entry))
@@ -2933,6 +2951,13 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
_dst_pmd = move_soft_dirty_pmd(src_pmdval);
_dst_pmd = clear_uffd_wp_pmd(_dst_pmd);
}
+
+ /* Re-arm RWP on the moved PMD if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma)) {
+ _dst_pmd = pmd_modify(_dst_pmd, PAGE_NONE);
+ _dst_pmd = pmd_mkuffd(_dst_pmd);
+ }
+
set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
@@ -3109,6 +3134,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
entry = pte_mkspecial(entry);
if (pmd_uffd(old_pmd))
entry = pte_mkuffd(entry);
+
+ /* Restore PAGE_NONE so an RWP marker keeps trapping */
+ if (userfaultfd_rwp(vma) && pmd_uffd(old_pmd))
+ entry = pte_modify(entry, PAGE_NONE);
+
VM_BUG_ON(!pte_none(ptep_get(pte)));
set_pte_at(mm, addr, pte, entry);
pte++;
@@ -3383,6 +3413,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (uffd_wp)
entry = pte_mkuffd(entry);
+ /* Restore PAGE_NONE so an RWP marker keeps trapping */
+ if (userfaultfd_rwp(vma) && uffd_wp)
+ entry = pte_modify(entry, PAGE_NONE);
+
for (i = 0; i < HPAGE_PMD_NR; i++)
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
@@ -5055,6 +5089,11 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mkwrite(pmde, vma);
if (pmd_swp_uffd(*pvmw->pmd))
pmde = pmd_mkuffd(pmde);
+
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pmd_swp_uffd(*pvmw->pmd) && userfaultfd_rwp(vma))
+ pmde = pmd_modify(pmde, PAGE_NONE);
+
if (!softleaf_is_migration_young(entry))
pmde = pmd_mkold(pmde);
/* NOTE: this may contain setting soft-dirty on some archs */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4d75b69d4272..0d8d39cd8888 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4843,8 +4843,16 @@ hugetlb_install_folio(struct vm_area_struct *vma, pte_t *ptep, unsigned long add
__folio_mark_uptodate(new_folio);
hugetlb_add_new_anon_rmap(new_folio, vma, addr);
- if (userfaultfd_wp(vma) && huge_pte_uffd(old))
+ if (userfaultfd_protected(vma) && huge_pte_uffd(old)) {
newpte = huge_pte_mkuffd(newpte);
+ /* Restore PAGE_NONE so the RWP marker keeps trapping. */
+ if (userfaultfd_rwp(vma)) {
+ unsigned int shift = huge_page_shift(hstate_vma(vma));
+
+ newpte = huge_pte_modify(newpte, PAGE_NONE);
+ newpte = arch_make_huge_pte(newpte, shift, vma->vm_flags);
+ }
+ }
set_huge_pte_at(vma->vm_mm, addr, ptep, newpte, sz);
hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
folio_set_hugetlb_migratable(new_folio);
@@ -4917,7 +4925,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = softleaf_from_pte(entry);
if (unlikely(softleaf_is_hwpoison(softleaf))) {
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(softleaf_is_migration(softleaf))) {
@@ -4931,11 +4939,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = make_readable_migration_entry(
swp_offset(softleaf));
entry = swp_entry_to_pte(softleaf);
- if (userfaultfd_wp(src_vma) && uffd)
+ if (userfaultfd_protected(src_vma) && uffd)
entry = pte_swp_mkuffd(entry);
set_huge_pte_at(src, addr, src_pte, entry, sz);
}
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(pte_is_marker(entry))) {
@@ -5000,6 +5008,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
goto next;
}
+ /* See __copy_present_ptes(): restore accessible protection. */
+ if (!userfaultfd_protected(dst_vma)) {
+ if (userfaultfd_rwp(src_vma) && huge_pte_uffd(entry)) {
+ entry = huge_pte_modify(entry, dst_vma->vm_page_prot);
+ entry = arch_make_huge_pte(entry, huge_page_shift(h),
+ dst_vma->vm_flags);
+ }
+ entry = huge_pte_clear_uffd(entry);
+ }
+
if (cow) {
/*
* No need to notify as we are downgrading page
@@ -5012,9 +5030,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
entry = huge_pte_wrprotect(entry);
}
- if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd(entry);
-
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
hugetlb_count_add(npages, dst);
}
@@ -5060,10 +5075,22 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
huge_pte_clear(mm, new_addr, dst_pte, sz);
} else {
if (need_clear_uffd_wp) {
- if (pte_present(pte))
+ if (pte_present(pte)) {
+ /*
+ * See __copy_present_ptes(): normalise RWP
+ * PTEs so the destination starts accessible
+ * instead of taking a numa-hinting fault on
+ * first access.
+ */
+ if (userfaultfd_rwp(vma)) {
+ pte = huge_pte_modify(pte, vma->vm_page_prot);
+ pte = arch_make_huge_pte(pte, huge_page_shift(h),
+ vma->vm_flags);
+ }
pte = huge_pte_clear_uffd(pte);
- else
+ } else {
pte = pte_swp_clear_uffd(pte);
+ }
}
set_huge_pte_at(mm, new_addr, dst_pte, pte, sz);
}
@@ -6515,6 +6542,13 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_mkuffd(pte);
else if (uffd_wp_resolve || uffd_rwp_resolve)
pte = huge_pte_clear_uffd(pte);
+
+ /* Preserve RWP protection across mprotect() */
+ if (userfaultfd_rwp(vma) && huge_pte_uffd(pte)) {
+ pte = huge_pte_modify(pte, PAGE_NONE);
+ pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ }
+
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
tlb_remove_huge_tlb_entry(h, &tlb, ptep, address);
diff --git a/mm/memory.c b/mm/memory.c
index c4fd5cb4a08f..06473285c0dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -896,6 +896,10 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
if (pte_swp_uffd(orig_pte))
pte = pte_mkuffd(pte);
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(orig_pte) && userfaultfd_rwp(vma))
+ pte = pte_modify(pte, PAGE_NONE);
+
if ((vma->vm_flags & VM_WRITE) &&
can_change_pte_writable(vma, address, pte)) {
if (folio_test_dirty(folio))
@@ -1041,7 +1045,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
make_pte_marker(marker));
return 0;
}
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
pte = pte_swp_clear_uffd(pte);
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
@@ -1088,9 +1092,13 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
/* All done, just insert the new page copy in the child */
pte = folio_mk_pte(new_folio, dst_vma->vm_page_prot);
pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
- if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte)))
- /* Uffd-wp needs to be delivered to dest pte as well */
+ if (userfaultfd_protected(dst_vma) && pte_uffd(ptep_get(src_pte))) {
+ /* The uffd bit needs to be delivered to the dest pte as well */
pte = pte_mkuffd(pte);
+ /* Restore PAGE_NONE so the RWP marker keeps trapping */
+ if (userfaultfd_rwp(dst_vma))
+ pte = pte_modify(pte, PAGE_NONE);
+ }
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
return 0;
}
@@ -1100,9 +1108,31 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
pte_t pte, unsigned long addr, int nr)
{
struct mm_struct *src_mm = src_vma->vm_mm;
+ bool writable;
+
+ /*
+ * Snapshot writability before the RWP-disarm rewrite below: when the
+ * child is not RWP-armed, pte_modify(pte, dst_vma->vm_page_prot) can
+ * silently drop _PAGE_RW from a resolved (no-marker) writable PTE,
+ * so a later pte_write(pte) check would skip the COW wrprotect and
+ * leave the parent writable over a folio shared with the child.
+ */
+ writable = pte_write(pte);
+
+ /*
+ * Child is not RWP-armed: restore accessible protection so the
+ * inherited PAGE_NONE does not cost a fault on first read. Gate on
+ * pte_uffd(pte) so unrelated PAGE_NONE markers (e.g. NUMA balancing)
+ * are not normalised away.
+ */
+ if (!userfaultfd_protected(dst_vma)) {
+ if (userfaultfd_rwp(src_vma) && pte_uffd(pte))
+ pte = pte_modify(pte, dst_vma->vm_page_prot);
+ pte = pte_clear_uffd(pte);
+ }
/* If it's a COW mapping, write protect it both processes. */
- if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
+ if (is_cow_mapping(src_vma->vm_flags) && writable) {
wrprotect_ptes(src_mm, addr, src_pte, nr);
pte = pte_wrprotect(pte);
}
@@ -1112,9 +1142,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
- if (!userfaultfd_wp(dst_vma))
- pte = pte_clear_uffd(pte);
-
set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
}
@@ -5041,6 +5068,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (pte_swp_uffd(vmf->orig_pte))
pte = pte_mkuffd(pte);
+ /*
+ * A page reclaimed while RWP-protected carries the uffd bit on
+ * its swap entry. Re-apply PAGE_NONE on swap-in so the first access
+ * still traps as an RWP fault. pte_modify() preserves _PAGE_UFFD.
+ */
+ if (pte_swp_uffd(vmf->orig_pte) && userfaultfd_rwp(vma))
+ pte = pte_modify(pte, PAGE_NONE);
+
/*
* Same logic as in do_wp_page(); however, optimize for pages that are
* certainly not shared either because we just allocated them without
diff --git a/mm/migrate.c b/mm/migrate.c
index 4bdb5be7afbf..8d7fd0b056b6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -329,6 +329,10 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
if (pte_swp_uffd(old_pte))
newpte = pte_mkuffd(newpte);
+ /* See remove_migration_pte(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(old_pte) && userfaultfd_rwp(pvmw->vma))
+ newpte = pte_modify(newpte, PAGE_NONE);
+
set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
@@ -394,6 +398,10 @@ static bool remove_migration_pte(struct folio *folio,
else if (pte_swp_uffd(old_pte))
pte = pte_mkuffd(pte);
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(old_pte) && userfaultfd_rwp(vma))
+ pte = pte_modify(pte, PAGE_NONE);
+
if (folio_test_anon(folio) && !softleaf_is_migration_read(entry))
rmap_flags |= RMAP_EXCLUSIVE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7dcc94e7bfd6..cc85a8862c28 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -296,6 +296,16 @@ static __always_inline void change_present_ptes(struct mmu_gather *tlb,
else if (uffd_prot_resolve)
ptent = pte_clear_uffd(ptent);
+ /*
+ * The uffd bit on a VM_UFFD_RWP VMA carries PROT_NONE
+ * semantics. If mprotect() or NUMA hinting changed the
+ * base protection, restore PAGE_NONE so the PTE still
+ * traps on any access. pte_modify() preserves
+ * _PAGE_UFFD.
+ */
+ if (userfaultfd_rwp(vma) && pte_uffd(ptent))
+ ptent = pte_modify(ptent, PAGE_NONE);
+
/*
* In some writable, shared mappings, we might want
* to catch actual write access -- see
diff --git a/mm/mremap.c b/mm/mremap.c
index 12732a5c547e..8a46ec5831c8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -296,10 +296,19 @@ static int move_ptes(struct pagetable_move_control *pmc,
pte_clear(mm, new_addr, new_ptep);
else {
if (need_clear_uffd_wp) {
- if (pte_present(pte))
+ if (pte_present(pte)) {
+ /*
+ * See __copy_present_ptes(): normalise
+ * RWP PTEs so the destination starts
+ * accessible instead of taking a
+ * numa-hinting fault on first access.
+ */
+ if (userfaultfd_rwp(vma) && pte_uffd(pte))
+ pte = pte_modify(pte, vma->vm_page_prot);
pte = pte_clear_uffd(pte);
- else
+ } else {
pte = pte_swp_clear_uffd(pte);
+ }
}
set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 15fdca2da1f7..27cc299ead9b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2559,6 +2559,11 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
new_pte = pte_mksoft_dirty(new_pte);
if (pte_swp_uffd(old_pte))
new_pte = pte_mkuffd(new_pte);
+
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(old_pte) && userfaultfd_rwp(vma))
+ new_pte = pte_modify(new_pte, PAGE_NONE);
+
setpte:
set_pte_at(vma->vm_mm, addr, pte, new_pte);
folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9d74be69873a..e30878e4e00b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1285,6 +1285,13 @@ static long move_present_ptes(struct mm_struct *mm,
if (pte_dirty(orig_src_pte))
orig_dst_pte = pte_mkdirty(orig_dst_pte);
orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
+
+ /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma)) {
+ orig_dst_pte = pte_modify(orig_dst_pte, PAGE_NONE);
+ orig_dst_pte = pte_mkuffd(orig_dst_pte);
+ }
+
set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
src_addr += PAGE_SIZE;
@@ -1366,6 +1373,9 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
if (pgtable_supports_soft_dirty())
orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
+ /* Re-arm RWP on the moved swap entry if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma))
+ orig_src_pte = pte_swp_mkuffd(orig_src_pte);
set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
double_pt_unlock(dst_ptl, src_ptl);
@@ -1392,6 +1402,13 @@ static int move_zeropage_pte(struct mm_struct *mm,
zero_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr),
dst_vma->vm_page_prot));
+
+ /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma)) {
+ zero_pte = pte_modify(zero_pte, PAGE_NONE);
+ zero_pte = pte_mkuffd(zero_pte);
+ }
+
ptep_clear_flush(src_vma, src_addr, src_pte);
set_pte_at(mm, dst_addr, dst_pte, zero_pte);
double_pt_unlock(dst_ptl, src_ptl);
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 08/15] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (6 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 07/15] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-06-03 12:57 ` Lorenzo Stoakes
2026-05-29 17:26 ` [PATCH v6 09/15] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau (Meta)
` (6 subsequent siblings)
14 siblings, 1 reply; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Three mm paths outside the fault handler gate on the uffd PTE bit
today: khugepaged (skip collapse on ranges carrying markers), rmap
(cap unmap batching), and GUP (force a fault through
gup_can_follow_protnone). Extend each to treat VM_UFFD_RWP the same
as VM_UFFD_WP; otherwise per-PTE RWP state is silently destroyed or
bypassed.
khugepaged: try_collapse_pte_mapped_thp() and
file_backed_vma_is_retractable() already refuse to collapse or
retract page tables on ranges carrying the uffd PTE bit. Broaden the
VMA predicate from userfaultfd_wp() to userfaultfd_protected() so
VM_UFFD_RWP ranges get the same protection. hpage_collapse_scan_pmd()
needs no change — its existing pte_uffd() check already catches an
RWP PTE because it carries the uffd bit.
rmap: folio_unmap_pte_batch() caps batching at 1 for VM_UFFD_RWP so
the restore path handles each PTE with its own marker.
GUP: gup_can_follow_protnone() forces a fault on VM_UFFD_RWP VMAs
regardless of FOLL_HONOR_NUMA_FAULT. RWP uses protnone as an
access-tracking marker, not for NUMA hinting, so any GUP — read or
write — must go through the userfaultfd fault path.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 16 +++++++++++++++-
mm/khugepaged.c | 18 +++++++++++-------
mm/rmap.c | 2 +-
3 files changed, 27 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3d4d5f9a6f1b..2b04f690b516 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4644,11 +4644,25 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
/*
* Indicates whether GUP can follow a PROT_NONE mapped page, or whether
- * a (NUMA hinting) fault is required.
+ * a (NUMA hinting or userfaultfd RWP) fault is required.
*/
static inline bool gup_can_follow_protnone(const struct vm_area_struct *vma,
unsigned int flags)
{
+ /*
+ * VM_UFFD_RWP uses protnone as an access-tracking marker, not for
+ * NUMA hinting. GUP must always take a fault so the access is
+ * delivered to userfaultfd, regardless of FOLL_HONOR_NUMA_FAULT.
+ *
+ * Only do so while the VMA is accessible. If it has been made
+ * inaccessible (e.g. mprotect(PROT_NONE)), fall through to the guard
+ * below: forcing a fault there would loop, as handle_mm_fault() makes
+ * no progress on protnone in an inaccessible VMA, and the access is
+ * denied regardless of RWP anyway.
+ */
+ if ((vma->vm_flags & VM_UFFD_RWP) && vma_is_accessible(vma))
+ return false;
+
/*
* If callers don't want to honor NUMA hinting faults, no need to
* determine if we would actually have to trigger a NUMA hinting fault.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index afa218be15de..4f3fedcd75cf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1895,8 +1895,11 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign
if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
return SCAN_VMA_CHECK;
- /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
- if (userfaultfd_wp(vma))
+ /*
+ * Keep pmd pgtable while the uffd bit is in use; see comment in
+ * retract_page_tables().
+ */
+ if (userfaultfd_protected(vma))
return SCAN_PTE_UFFD;
folio = filemap_lock_folio(vma->vm_file->f_mapping,
@@ -2109,13 +2112,14 @@ static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
return false;
/*
- * When a vma is registered with uffd-wp, we cannot recycle
+ * When a vma is registered with uffd-wp or RWP, we cannot recycle
* the page table because there may be pte markers installed.
- * Other vmas can still have the same file mapped hugely, but
- * skip this one: it will always be mapped in small page size
- * for uffd-wp registered ranges.
+ * VM_UFFD_RWP ranges similarly rely on per-PTE uffd state
+ * and cannot be recycled to a shared PMD. Other vmas can still
+ * have the same file mapped hugely, but skip this one: it will
+ * always be mapped in small page size for these registrations.
*/
- if (userfaultfd_wp(vma))
+ if (userfaultfd_protected(vma))
return false;
/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 546bc1cf9391..9fb733489898 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1965,7 +1965,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
if (pte_unused(pte))
return 1;
- if (userfaultfd_wp(vma))
+ if (userfaultfd_protected(vma))
return 1;
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 09/15] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (7 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 08/15] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 10/15] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau (Meta)
` (5 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Add the userspace interface for read-write protection tracking:
- UFFDIO_REGISTER_MODE_RWP register a range for RWP tracking
- UFFD_FEATURE_RWP capability bit
- UFFDIO_RWPROTECT install / remove RWP on a range
Introduce CONFIG_USERFAULTFD_RWP, auto-selected on 64-bit kernels with
ARCH_HAS_PTE_PROTNONE and HAVE_ARCH_USERFAULTFD_WP. The symbol gates
VM_UFFD_RWP (previously aliased to VM_NONE) and the smaps/trace-flag
hooks added in the preparatory patches; without it the UAPI bits added
here have nothing to drive and would be unreachable.
Registration sets VM_UFFD_RWP on the VMA. Combining MODE_WP with
MODE_RWP is rejected because both modes claim the uffd PTE bit.
UFFDIO_RWPROTECT is the bidirectional counterpart of
UFFDIO_WRITEPROTECT:
- MODE_RWP change_protection() with MM_CP_UFFD_RWP
installs PAGE_NONE and sets the uffd bit on
present PTEs
- !MODE_RWP change_protection() with MM_CP_UFFD_RWP_RESOLVE
restores vma->vm_page_prot and clears the bit
userfaultfd_clear_vma() runs the same resolve pass on unregister so
RWP state cannot outlive the uffd.
Re-registering a range must not drop a mode that installs per-PTE
markers (WP or RWP); doing so returns -EBUSY. This also closes a
pre-existing window where re-registering without MODE_WP would strand
uffd-wp markers: before, those caused extra write-faults but were
otherwise benign; with RWP preservation in place, a subsequent
mprotect() on a VM_UFFD_RWP VMA would silently promote the stale
markers to RWP.
The feature is not yet advertised. UFFDIO_REGISTER_MODE_RWP,
UFFD_FEATURE_RWP, and _UFFDIO_RWPROTECT are intentionally absent from
UFFD_API_REGISTER_MODES, UFFD_API_FEATURES, and UFFD_API_RANGE_IOCTLS,
so UFFDIO_API masks them out and the register-mode validator rejects
the bit. The follow-up patch adds fault dispatch and exposes the UAPI.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
Documentation/admin-guide/mm/userfaultfd.rst | 10 +
include/linux/userfaultfd_k.h | 2 +
include/uapi/linux/userfaultfd.h | 19 ++
mm/Kconfig | 9 +
mm/userfaultfd.c | 189 ++++++++++++++++++-
5 files changed, 226 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index e5cc8848dcb3..1e533639fd50 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -131,6 +131,16 @@ userfaults on the range registered. Not all ioctls will necessarily be
supported for all memory types (e.g. anonymous memory vs. shmem vs.
hugetlbfs), or all types of intercepted faults.
+.. note::
+
+ Re-registering an already-registered range must not drop any of the
+ modes that install per-PTE markers — currently
+ ``UFFDIO_REGISTER_MODE_WP`` and ``UFFDIO_REGISTER_MODE_RWP``. Doing
+ so would strand markers with no flag to describe them, so the call
+ is rejected with ``-EBUSY``; userspace must issue
+ ``UFFDIO_UNREGISTER`` first. This differs from older kernels, which
+ silently replaced the mode bits on re-registration.
+
Userland can use the ``uffdio_register.ioctls`` to manage the virtual
address space in the background (to add or potentially also remove
memory from the ``userfaultfd`` registered range). This means a userfault
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 5115827981a2..8e0833e6613f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -150,6 +150,8 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_at
extern long uffd_wp_range(struct vm_area_struct *vma,
unsigned long start, unsigned long len, bool enable_wp);
+extern int mrwprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
+ unsigned long len, bool enable_rwp);
/* move_pages */
void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 2841e4ea8f2c..7b78aa3b5318 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -79,6 +79,7 @@
#define _UFFDIO_WRITEPROTECT (0x06)
#define _UFFDIO_CONTINUE (0x07)
#define _UFFDIO_POISON (0x08)
+#define _UFFDIO_RWPROTECT (0x09)
#define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */
@@ -103,6 +104,8 @@
struct uffdio_continue)
#define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \
struct uffdio_poison)
+#define UFFDIO_RWPROTECT _IOWR(UFFDIO, _UFFDIO_RWPROTECT, \
+ struct uffdio_rwprotect)
/* read() structure */
struct uffd_msg {
@@ -158,6 +161,7 @@ struct uffd_msg {
#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
#define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */
#define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */
+#define UFFD_PAGEFAULT_FLAG_RWP (1<<3) /* If reason is VM_UFFD_RWP */
struct uffdio_api {
/* userland asks for an API number and the features to enable */
@@ -230,6 +234,11 @@ struct uffdio_api {
*
* UFFD_FEATURE_MOVE indicates that the kernel supports moving an
* existing page contents from userspace.
+ *
+ * UFFD_FEATURE_RWP indicates that the kernel supports
+ * UFFDIO_REGISTER_MODE_RWP for read-write protection tracking.
+ * Pages are made inaccessible via UFFDIO_RWPROTECT and faults
+ * are delivered when the pages are re-accessed.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -248,6 +257,7 @@ struct uffdio_api {
#define UFFD_FEATURE_POISON (1<<14)
#define UFFD_FEATURE_WP_ASYNC (1<<15)
#define UFFD_FEATURE_MOVE (1<<16)
+#define UFFD_FEATURE_RWP (1<<17)
__u64 features;
__u64 ioctls;
@@ -263,6 +273,7 @@ struct uffdio_register {
#define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
#define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
#define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2)
+#define UFFDIO_REGISTER_MODE_RWP ((__u64)1<<3)
__u64 mode;
/*
@@ -356,6 +367,14 @@ struct uffdio_poison {
__s64 updated;
};
+struct uffdio_rwprotect {
+ struct uffdio_range range;
+ /* !RWP means undo RWP-protection */
+#define UFFDIO_RWPROTECT_MODE_RWP ((__u64)1<<0)
+#define UFFDIO_RWPROTECT_MODE_DONTWAKE ((__u64)1<<1)
+ __u64 mode;
+};
+
struct uffdio_move {
__u64 dst;
__u64 src;
diff --git a/mm/Kconfig b/mm/Kconfig
index 776b67c66e82..fac01bcfc0d1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1333,6 +1333,15 @@ config HAVE_ARCH_USERFAULTFD_MINOR
help
Arch has userfaultfd minor fault support
+config USERFAULTFD_RWP
+ def_bool y
+ depends on 64BIT && ARCH_HAS_PTE_PROTNONE && HAVE_ARCH_USERFAULTFD_WP
+ help
+ Userfaultfd read-write protection (UFFDIO_RWPROTECT) delivers a
+ userfaultfd notification on every access -- read or write -- to a
+ protected range, letting userspace observe the working set of a
+ process.
+
menuconfig USERFAULTFD
bool "Enable userfaultfd() system call"
depends on MMU
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e30878e4e00b..c07e3232a01a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1157,6 +1157,75 @@ static int mwriteprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
return err;
}
+int mrwprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
+ unsigned long len, bool enable_rwp)
+{
+ struct mm_struct *dst_mm = ctx->mm;
+ unsigned long end = start + len;
+ struct vm_area_struct *dst_vma;
+ unsigned int mm_cp_flags;
+ struct mmu_gather tlb;
+ bool found = false;
+ VMA_ITERATOR(vmi, dst_mm, start);
+
+ VM_WARN_ON_ONCE(start & ~PAGE_MASK);
+ VM_WARN_ON_ONCE(len & ~PAGE_MASK);
+ VM_WARN_ON_ONCE(start + len <= start);
+
+ guard(mmap_read_lock)(dst_mm);
+ guard(rwsem_read)(&ctx->map_changing_lock);
+
+ if (atomic_read(&ctx->mmap_changing))
+ return -EAGAIN;
+
+ if (enable_rwp)
+ mm_cp_flags = MM_CP_UFFD_RWP;
+ else
+ mm_cp_flags = MM_CP_UFFD_RWP_RESOLVE;
+
+ /*
+ * Pre-scan the range: validate every spanned VMA before applying
+ * any change_protection() so a partial failure cannot leave the
+ * process with only a prefix of the range re-protected.
+ */
+ for_each_vma_range(vmi, dst_vma, end) {
+ if (!userfaultfd_rwp(dst_vma))
+ return -ENOENT;
+
+ if (is_vm_hugetlb_page(dst_vma)) {
+ unsigned long page_mask;
+
+ page_mask = vma_kernel_pagesize(dst_vma) - 1;
+ if ((start & page_mask) || (len & page_mask))
+ return -EINVAL;
+ }
+ found = true;
+ }
+ if (!found)
+ return -ENOENT;
+
+ vma_iter_set(&vmi, start);
+ tlb_gather_mmu(&tlb, dst_mm);
+ for_each_vma_range(vmi, dst_vma, end) {
+ unsigned long vma_start = max(dst_vma->vm_start, start);
+ unsigned long vma_end = min(dst_vma->vm_end, end);
+ unsigned int flags = mm_cp_flags;
+
+ /*
+ * On resolve, try to upgrade writability per-VMA --
+ * MM_CP_TRY_CHANGE_WRITABLE WARNs in
+ * maybe_change_pte_writable() if the VMA is not VM_WRITE,
+ * and RWP can be registered on PROT_READ-only mappings.
+ */
+ if (!enable_rwp && vma_wants_manual_pte_write_upgrade(dst_vma))
+ flags |= MM_CP_TRY_CHANGE_WRITABLE;
+
+ change_protection(&tlb, dst_vma, vma_start, vma_end, flags);
+ }
+ tlb_finish_mmu(&tlb);
+
+ return 0;
+}
void double_pt_lock(spinlock_t *ptl1,
spinlock_t *ptl2)
@@ -2145,6 +2214,15 @@ static bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
!vma_is_anonymous(vma))
return false;
+ /*
+ * RWP uses protnone as an access-tracking marker. PROT_NONE VMAs
+ * have vm_page_prot == PAGE_NONE, so RWP resolution can't make a
+ * page accessible -- the next access would fault again. Reject up
+ * front instead of letting FOLL_FORCE loop on protnone+uffd PTEs.
+ */
+ if ((vm_flags & VM_UFFD_RWP) && !vma_is_accessible(vma))
+ return false;
+
return ops->can_userfault(vma, vm_flags);
}
@@ -2197,9 +2275,22 @@ static struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi,
if (start == vma->vm_start && end == vma->vm_end)
give_up_on_oom = true;
- /* Reset ptes for the whole vma range if wr-protected */
- if (userfaultfd_wp(vma))
- uffd_wp_range(vma, start, end - start, false);
+ /* Clear the uffd bit and/or restore protnone PTEs */
+ if (userfaultfd_protected(vma)) {
+ unsigned int mm_cp_flags = 0;
+ struct mmu_gather tlb;
+
+ if (userfaultfd_wp(vma))
+ mm_cp_flags |= MM_CP_UFFD_WP_RESOLVE;
+ if (userfaultfd_rwp(vma))
+ mm_cp_flags |= MM_CP_UFFD_RWP_RESOLVE;
+ if (vma_wants_manual_pte_write_upgrade(vma))
+ mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
+
+ tlb_gather_mmu(&tlb, vma->vm_mm);
+ change_protection(&tlb, vma, start, end, mm_cp_flags);
+ tlb_finish_mmu(&tlb);
+ }
ret = vma_modify_flags_uffd(vmi, prev, vma, start, end,
&new_vma_flags, NULL_VM_UFFD_CTX,
@@ -2248,6 +2339,14 @@ static int userfaultfd_register_range(struct userfaultfd_ctx *ctx,
vma_test_all_mask(vma, vma_flags))
goto skip;
+ /*
+ * Pre-scan in userfaultfd_register() already rejected mode
+ * switches that would drop VM_UFFD_WP or VM_UFFD_RWP, so a
+ * stray bit here is a bug.
+ */
+ VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx == ctx &&
+ vma->vm_flags & (VM_UFFD_WP | VM_UFFD_RWP) & ~vm_flags);
+
if (vma->vm_start > start)
start = vma->vm_start;
vma_end = min(end, vma->vm_end);
@@ -2514,6 +2613,8 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
if (reason & VM_UFFD_WP)
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
+ if (reason & VM_UFFD_RWP)
+ msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_RWP;
if (reason & VM_UFFD_MINOR)
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
if (features & UFFD_FEATURE_THREAD_ID)
@@ -3613,6 +3714,22 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vm_flags |= VM_UFFD_WP;
}
+ if (uffdio_register.mode & UFFDIO_REGISTER_MODE_RWP) {
+ if (!pgtable_supports_uffd() || VM_UFFD_RWP == VM_NONE)
+ goto out;
+ if (!(ctx->features & UFFD_FEATURE_RWP))
+ goto out;
+ vm_flags |= VM_UFFD_RWP;
+ }
+
+ /*
+ * WP and RWP share the uffd PTE bit and
+ * cannot coexist in the same VMA — the bit would carry ambiguous
+ * semantics. Reject the combination up front.
+ */
+ if ((vm_flags & VM_UFFD_WP) && (vm_flags & VM_UFFD_RWP))
+ goto out;
+
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR) {
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
goto out;
@@ -3706,6 +3823,16 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
cur->vm_userfaultfd_ctx.ctx != ctx)
goto out_unlock;
+ /*
+ * Mode switches that drop VM_UFFD_WP or VM_UFFD_RWP would
+ * leave PTE markers without the flag that describes them;
+ * subsequent mprotect() would then promote stale markers
+ * into the other mode. Require an unregister first.
+ */
+ if (cur->vm_userfaultfd_ctx.ctx == ctx &&
+ cur->vm_flags & (VM_UFFD_WP | VM_UFFD_RWP) & ~vm_flags)
+ goto out_unlock;
+
/*
* Note vmas containing huge pages
*/
@@ -3739,6 +3866,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
+ /* RWPROTECT is only supported for RWP ranges */
+ if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_RWP))
+ ioctls_out &= ~((__u64)1 << _UFFDIO_RWPROTECT);
+
/*
* Now that we scanned all vmas we can already tell
* userland which ioctls methods are guaranteed to
@@ -4086,6 +4217,55 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
return ret;
}
+static int userfaultfd_rwprotect(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ int ret;
+ struct uffdio_rwprotect uffdio_rwp;
+ struct userfaultfd_wake_range range;
+ bool mode_rwp, mode_dontwake;
+
+ if (atomic_read(&ctx->mmap_changing))
+ return -EAGAIN;
+
+ if (copy_from_user(&uffdio_rwp, (void __user *)arg,
+ sizeof(uffdio_rwp)))
+ return -EFAULT;
+
+ ret = validate_range(ctx->mm, uffdio_rwp.range.start,
+ uffdio_rwp.range.len);
+ if (ret)
+ return ret;
+
+ if (uffdio_rwp.mode & ~(UFFDIO_RWPROTECT_MODE_DONTWAKE |
+ UFFDIO_RWPROTECT_MODE_RWP))
+ return -EINVAL;
+
+ mode_rwp = uffdio_rwp.mode & UFFDIO_RWPROTECT_MODE_RWP;
+ mode_dontwake = uffdio_rwp.mode & UFFDIO_RWPROTECT_MODE_DONTWAKE;
+
+ if (mode_rwp && mode_dontwake)
+ return -EINVAL;
+
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mrwprotect_range(ctx, uffdio_rwp.range.start,
+ uffdio_rwp.range.len, mode_rwp);
+ mmput(ctx->mm);
+ } else {
+ return -ESRCH;
+ }
+
+ if (ret)
+ return ret;
+
+ if (!mode_rwp && !mode_dontwake) {
+ range.start = uffdio_rwp.range.start;
+ range.len = uffdio_rwp.range.len;
+ wake_userfault(ctx, &range);
+ }
+ return ret;
+}
+
static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
{
__s64 ret;
@@ -4392,6 +4572,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_POISON:
ret = userfaultfd_poison(ctx, arg);
break;
+ case UFFDIO_RWPROTECT:
+ ret = userfaultfd_rwprotect(ctx, arg);
+ break;
}
return ret;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 10/15] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (8 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 09/15] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 11/15] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau (Meta)
` (4 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Wire the fault side of read-write protection tracking and turn the
userspace interface on.
An RWP-protected PTE is PAGE_NONE with the uffd bit set. The
PROT_NONE triggers a fault on any access; the uffd bit distinguishes
it from plain mprotect(PROT_NONE) or NUMA hinting.
Fault dispatch, per level:
PTE handle_pte_fault() -> do_uffd_rwp()
PMD __handle_mm_fault() -> do_huge_pmd_uffd_rwp()
hugetlb hugetlb_fault() -> hugetlb_handle_userfault()
The RWP branches gate on userfaultfd_pte_rwp() / userfaultfd_huge_pmd_rwp()
(VM_UFFD_RWP plus the uffd bit) and fall through to do_numa_page() /
do_huge_pmd_numa_page() otherwise. Each delivers a
UFFD_PAGEFAULT_FLAG_RWP message through handle_userfault(); the handler
resolves it with UFFDIO_RWPROTECT clearing MODE_RWP.
userfaultfd_must_wait() and userfaultfd_huge_must_wait() add matching
protnone+uffd waiters so sync-mode fault handlers block correctly.
Expose the UAPI:
UFFDIO_REGISTER_MODE_RWP -> UFFD_API_REGISTER_MODES
UFFD_FEATURE_RWP -> UFFD_API_FEATURES
_UFFDIO_RWPROTECT -> UFFD_API_RANGE_IOCTLS
UFFD_API_RANGE_IOCTLS_BASIC
UFFD_FEATURE_RWP is masked out at UFFDIO_API time when PROT_NONE is
not available or VM_UFFD_RWP aliases VM_NONE (32-bit), so userspace
never sees an advertised-but-broken feature.
Works on anonymous, shmem, and hugetlb memory.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/huge_mm.h | 7 +++++++
include/linux/userfaultfd_k.h | 24 ++++++++++++++++++++++++
include/uapi/linux/userfaultfd.h | 12 ++++++++----
mm/huge_memory.c | 5 +++++
mm/hugetlb.c | 11 +++++++++++
mm/memory.c | 31 +++++++++++++++++++++++++++++--
mm/userfaultfd.c | 32 ++++++++++++++++++++++++++++++--
7 files changed, 114 insertions(+), 8 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index edece3e26985..fe48d76957fb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -529,6 +529,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf);
+
vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
extern struct folio *huge_zero_folio;
@@ -716,6 +718,11 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
return NULL;
}
+static inline vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf)
+{
+ return 0;
+}
+
static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
{
return 0;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 8e0833e6613f..6b633ec694e1 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -236,6 +236,18 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
return userfaultfd_wp(vma) && pmd_uffd(pmd);
}
+static inline bool userfaultfd_pte_rwp(struct vm_area_struct *vma,
+ pte_t pte)
+{
+ return userfaultfd_rwp(vma) && pte_uffd(pte);
+}
+
+static inline bool userfaultfd_huge_pmd_rwp(struct vm_area_struct *vma,
+ pmd_t pmd)
+{
+ return userfaultfd_rwp(vma) && pmd_uffd(pmd);
+}
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return vma_test_any_mask(vma, __VMA_UFFD_FLAGS);
@@ -366,6 +378,18 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
return false;
}
+static inline bool userfaultfd_pte_rwp(struct vm_area_struct *vma,
+ pte_t pte)
+{
+ return false;
+}
+
+static inline bool userfaultfd_huge_pmd_rwp(struct vm_area_struct *vma,
+ pmd_t pmd)
+{
+ return false;
+}
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return false;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 7b78aa3b5318..d803e76d47ad 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -25,7 +25,8 @@
#define UFFD_API ((__u64)0xAA)
#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
UFFDIO_REGISTER_MODE_WP | \
- UFFDIO_REGISTER_MODE_MINOR)
+ UFFDIO_REGISTER_MODE_MINOR | \
+ UFFDIO_REGISTER_MODE_RWP)
#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
UFFD_FEATURE_EVENT_FORK | \
UFFD_FEATURE_EVENT_REMAP | \
@@ -42,7 +43,8 @@
UFFD_FEATURE_WP_UNPOPULATED | \
UFFD_FEATURE_POISON | \
UFFD_FEATURE_WP_ASYNC | \
- UFFD_FEATURE_MOVE)
+ UFFD_FEATURE_MOVE | \
+ UFFD_FEATURE_RWP)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -54,13 +56,15 @@
(__u64)1 << _UFFDIO_MOVE | \
(__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_CONTINUE | \
- (__u64)1 << _UFFDIO_POISON)
+ (__u64)1 << _UFFDIO_POISON | \
+ (__u64)1 << _UFFDIO_RWPROTECT)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_CONTINUE | \
- (__u64)1 << _UFFDIO_POISON)
+ (__u64)1 << _UFFDIO_POISON | \
+ (__u64)1 << _UFFDIO_RWPROTECT)
/*
* Valid ioctl command number range with this API is from 0x00 to
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6417d883d2e4..72cb44332004 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2289,6 +2289,11 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
return pmd_dirty(pmd);
}
+vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf)
+{
+ return handle_userfault(vmf, VM_UFFD_RWP);
+}
+
/* NUMA hinting page fault entry point for trans huge pmds */
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
{
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0d8d39cd8888..d4da39d698b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6062,6 +6062,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mutex;
}
+ /*
+ * Protnone hugetlb PTEs with the uffd bit are used by
+ * userfaultfd RWP for access tracking. Plain PROT_NONE (without the
+ * marker) is not an RWP fault and is not expected on hugetlb (no
+ * NUMA hinting), so let normal hugetlb fault handling proceed.
+ */
+ if (pte_protnone(vmf.orig_pte) && vma_is_accessible(vma) &&
+ userfaultfd_rwp(vma) && huge_pte_uffd(vmf.orig_pte)) {
+ return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP);
+ }
+
/*
* If we are going to COW/unshare the mapping later, we examine the
* pending reservations for this page now. This will ensure that any
diff --git a/mm/memory.c b/mm/memory.c
index 06473285c0dc..4f8b8dff0b7f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6122,6 +6122,16 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
if (!pte_present(ptent) || !pte_protnone(ptent))
continue;
+ /*
+ * RWP-armed PTEs are also protnone but carry _PAGE_UFFD as a
+ * marker. Leave them alone -- rewriting to vm_page_prot would
+ * stop the RWP trap. Gate on userfaultfd_rwp(vma) too:
+ * NUMA balancing preserves _PAGE_UFFD on UFFD_WP-marked PTEs
+ * when applying PROT_NONE, and those still need rebuilding.
+ */
+ if (userfaultfd_rwp(vma) && pte_uffd(ptent))
+ continue;
+
if (pfn_folio(pte_pfn(ptent)) != folio)
continue;
@@ -6137,6 +6147,12 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
}
}
+static vm_fault_t do_uffd_rwp(struct vm_fault *vmf)
+{
+ pte_unmap(vmf->pte);
+ return handle_userfault(vmf, VM_UFFD_RWP);
+}
+
static vm_fault_t do_numa_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
@@ -6412,8 +6428,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
- if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+ if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+ /*
+ * RWP-protected PTEs are protnone plus the uffd bit. On a
+ * VM_UFFD_RWP VMA, a protnone PTE without the uffd bit is
+ * NUMA hinting and must still fall through to do_numa_page().
+ */
+ if (userfaultfd_pte_rwp(vmf->vma, vmf->orig_pte))
+ return do_uffd_rwp(vmf);
return do_numa_page(vmf);
+ }
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
@@ -6527,8 +6551,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return 0;
}
if (pmd_trans_huge(vmf.orig_pmd)) {
- if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+ if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
+ if (userfaultfd_huge_pmd_rwp(vma, vmf.orig_pmd))
+ return do_huge_pmd_uffd_rwp(&vmf);
return do_huge_pmd_numa_page(&vmf);
+ }
if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
!pmd_write(vmf.orig_pmd)) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c07e3232a01a..db3707b9d977 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2668,6 +2668,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
*/
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
return true;
+ /*
+ * PTE is still RW-protected (protnone with uffd bit), wait for
+ * resolution. Plain PROT_NONE without the marker is not an RWP fault.
+ */
+ if (pte_protnone(pte) && huge_pte_uffd(pte) && (reason & VM_UFFD_RWP))
+ return true;
return false;
}
@@ -2728,8 +2734,14 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
if (!pmd_present(_pmd))
return false;
- if (pmd_trans_huge(_pmd))
- return !pmd_write(_pmd) && (reason & VM_UFFD_WP);
+ if (pmd_trans_huge(_pmd)) {
+ if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
+ return true;
+ if (pmd_protnone(_pmd) && pmd_uffd(_pmd) &&
+ (reason & VM_UFFD_RWP))
+ return true;
+ return false;
+ }
pte = pte_offset_map(pmd, address);
if (!pte)
@@ -2765,6 +2777,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
*/
if (!pte_write(ptent) && (reason & VM_UFFD_WP))
goto out;
+ /*
+ * PTE is still RW-protected (protnone with uffd bit), wait for
+ * userspace to resolve. Plain PROT_NONE without the marker is not
+ * an RWP fault.
+ */
+ if (pte_protnone(ptent) && pte_uffd(ptent) && (reason & VM_UFFD_RWP))
+ goto out;
ret = false;
out:
@@ -4506,6 +4525,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC;
}
+ /*
+ * RWP needs both PROT_NONE support and the uffd-wp PTE bit. The
+ * VM_UFFD_RWP check covers compile-time unavailability; the
+ * pgtable_supports_uffd() check covers runtime (e.g. riscv
+ * without the SVRSW60T59B extension) where the PTE bit is declared
+ * but not actually usable.
+ */
+ if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
+ uffdio_api.features &= ~UFFD_FEATURE_RWP;
ret = -EINVAL;
if (features & ~uffdio_api.features)
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 11/15] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (9 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 10/15] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 12/15] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau (Meta)
` (3 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
PAGEMAP_SCAN already reports PAGE_IS_WRITTEN from the inverted uffd
PTE bit, targeting the UFFDIO_WRITEPROTECT workflow. UFFDIO_RWPROTECT
reuses the same PTE bit as a marker for read-write protection, but
"has been written" and "has been accessed" are distinct semantic
signals — they happen to share one PTE bit today only because the two
implementations share infrastructure.
Give RWP its own pagemap category so the UAPI does not conflate them:
PAGE_IS_WRITTEN reported on VM_UFFD_WP VMAs, !pte_uffd(pte)
PAGE_IS_ACCESSED reported on VM_UFFD_RWP VMAs, !pte_uffd(pte)
Both still read the same PTE bit today, but each is scoped to the VMA
whose registered mode makes the bit meaningful. If a future
implementation moves RWP to a separate PTE bit, only PAGE_IS_ACCESSED
switches over.
This is a UAPI narrowing. Outside VM_UFFD_WP VMAs the uffd bit is
always clear, so PAGEMAP_SCAN used to flag PAGE_IS_WRITTEN on every
present PTE there — a meaningless duplicate of PAGE_IS_PRESENT. Now
PAGE_IS_WRITTEN fires only inside VM_UFFD_WP VMAs.
pagemap_hugetlb_category() now takes the vma like its PTE/PMD peers.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
Documentation/admin-guide/mm/pagemap.rst | 13 +++--
fs/proc/task_mmu.c | 63 +++++++++++++++++-------
include/uapi/linux/fs.h | 1 +
tools/include/uapi/linux/fs.h | 1 +
4 files changed, 57 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index c57e61b5d8aa..ffa690a171c8 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -19,8 +19,11 @@ There are four components to pagemap:
* Bit 55 pte is soft-dirty (see
Documentation/admin-guide/mm/soft-dirty.rst)
* Bit 56 page exclusively mapped (since 4.2)
- * Bit 57 pte is uffd-wp write-protected (since 5.13) (see
- Documentation/admin-guide/mm/userfaultfd.rst)
+ * Bit 57 pte is tracked by userfaultfd (since 5.13) — in a
+ ``VM_UFFD_WP`` VMA this indicates a write-protected PTE; in a
+ ``VM_UFFD_RWP`` VMA it indicates an RWP-protected PTE. WP and
+ RWP are mutually exclusive per VMA, so the meaning is
+ unambiguous. See Documentation/admin-guide/mm/userfaultfd.rst.
* Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page)
* Bits 59-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
@@ -244,7 +247,8 @@ in this IOCTL:
Following flags about pages are currently supported:
- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
-- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
+- ``PAGE_IS_WRITTEN`` - Page in a ``UFFDIO_REGISTER_MODE_WP`` VMA has been
+ written to since it was write-protected. Only reported inside such VMAs.
- ``PAGE_IS_FILE`` - Page is file backed
- ``PAGE_IS_PRESENT`` - Page is present in the memory
- ``PAGE_IS_SWAPPED`` - Page is in swapped
@@ -252,6 +256,9 @@ Following flags about pages are currently supported:
- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
- ``PAGE_IS_GUARD`` - Page is a part of a guard region
+- ``PAGE_IS_ACCESSED`` - Page in a ``UFFDIO_REGISTER_MODE_RWP`` VMA has been
+ accessed since RWP was applied. Only reported inside such VMAs. See
+ Documentation/admin-guide/mm/userfaultfd.rst for the RWP workflow.
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca0f69b347e8..a1683d73b405 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2284,7 +2284,7 @@ static const struct mm_walk_ops pagemap_ops = {
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst)
* Bit 56 page exclusively mapped
- * Bit 57 pte is uffd-wp write-protected
+ * Bit 57 pte is tracked by userfaultfd (uffd-wp or RWP)
* Bit 58 pte is a guard region
* Bits 59-60 zero
* Bit 61 page is file-page or shared-anon
@@ -2419,7 +2419,7 @@ static int pagemap_release(struct inode *inode, struct file *file)
PAGE_IS_FILE | PAGE_IS_PRESENT | \
PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \
PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY | \
- PAGE_IS_GUARD)
+ PAGE_IS_GUARD | PAGE_IS_ACCESSED)
#define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
struct pagemap_scan_private {
@@ -2444,8 +2444,12 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_PRESENT;
- if (!pte_uffd(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!pte_uffd(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (p->masks_of_interest & PAGE_IS_FILE) {
page = vm_normal_page(vma, addr, pte);
@@ -2462,8 +2466,12 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_any(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!pte_swp_uffd_any(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
entry = softleaf_from_pte(pte);
if (softleaf_is_guard_marker(entry))
@@ -2512,8 +2520,12 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
struct page *page;
categories |= PAGE_IS_PRESENT;
- if (!pmd_uffd(pmd))
- categories |= PAGE_IS_WRITTEN;
+ if (!pmd_uffd(pmd)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (p->masks_of_interest & PAGE_IS_FILE) {
page = vm_normal_page_pmd(vma, addr, pmd);
@@ -2527,8 +2539,12 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
categories |= PAGE_IS_SOFT_DIRTY;
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pmd_swp_uffd(pmd))
- categories |= PAGE_IS_WRITTEN;
+ if (!pmd_swp_uffd(pmd)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (pmd_swp_soft_dirty(pmd))
categories |= PAGE_IS_SOFT_DIRTY;
@@ -2561,7 +2577,8 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#ifdef CONFIG_HUGETLB_PAGE
-static unsigned long pagemap_hugetlb_category(pte_t pte)
+static unsigned long pagemap_hugetlb_category(struct vm_area_struct *vma,
+ pte_t pte)
{
unsigned long categories = PAGE_IS_HUGE;
@@ -2576,8 +2593,12 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
if (pte_present(pte)) {
categories |= PAGE_IS_PRESENT;
- if (!huge_pte_uffd(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!huge_pte_uffd(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (!PageAnon(pte_page(pte)))
categories |= PAGE_IS_FILE;
if (is_zero_pfn(pte_pfn(pte)))
@@ -2587,8 +2608,12 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_any(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!pte_swp_uffd_any(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (pte_swp_soft_dirty(pte))
categories |= PAGE_IS_SOFT_DIRTY;
}
@@ -2864,7 +2889,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
goto flush_and_return;
}
- if (!p->arg.category_anyof_mask && !p->arg.category_inverted &&
+ if (userfaultfd_wp(vma) && !p->arg.category_anyof_mask &&
+ !p->arg.category_inverted &&
p->arg.category_mask == PAGE_IS_WRITTEN &&
p->arg.return_mask == PAGE_IS_WRITTEN) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
@@ -2939,7 +2965,8 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
/* Go the short route when not write-protecting pages. */
pte = huge_ptep_get(walk->mm, start, ptep);
- categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+ categories = p->cur_vma_category |
+ pagemap_hugetlb_category(vma, pte);
if (!pagemap_scan_is_interesting_page(categories, p))
return 0;
@@ -2951,7 +2978,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
pte = huge_ptep_get(walk->mm, start, ptep);
- categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+ categories = p->cur_vma_category | pagemap_hugetlb_category(vma, pte);
if (!pagemap_scan_is_interesting_page(categories, p))
goto out_unlock;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 13f71202845e..c4aeaa0c31c7 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -455,6 +455,7 @@ typedef int __bitwise __kernel_rwf_t;
#define PAGE_IS_HUGE (1 << 6)
#define PAGE_IS_SOFT_DIRTY (1 << 7)
#define PAGE_IS_GUARD (1 << 8)
+#define PAGE_IS_ACCESSED (1 << 9)
/*
* struct page_region - Page region with flags
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index 24ddf7bc4f25..f0a26309b6d5 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -364,6 +364,7 @@ typedef int __bitwise __kernel_rwf_t;
#define PAGE_IS_HUGE (1 << 6)
#define PAGE_IS_SOFT_DIRTY (1 << 7)
#define PAGE_IS_GUARD (1 << 8)
+#define PAGE_IS_ACCESSED (1 << 9)
/*
* struct page_region - Page region with flags
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 12/15] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (10 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 11/15] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 13/15] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau (Meta)
` (2 subsequent siblings)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Sync RWP delivers a message and blocks the faulting thread until the
handler resolves the fault. For working-set tracking the VMM does not
need the message: it just needs to know, at scan time, which pages
were touched. Async RWP serves that use case — the kernel restores
access in-place and the faulting thread continues without blocking.
The VMM reconstructs the access pattern after the fact via
PAGEMAP_SCAN: pages whose uffd bit is still set (inverted
PAGE_IS_ACCESSED) were not re-accessed since the last RWP cycle.
Worth calling out: async resolution upgrades writable private anon
PTEs via pte_mkwrite() when can_change_pte_writable() allows, mirroring
do_numa_page(). Without it, every re-access of an RWP'd writable page
would COW-fault a second time.
UFFD_FEATURE_RWP_ASYNC requires UFFD_FEATURE_RWP.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/userfaultfd_k.h | 6 ++++++
include/uapi/linux/userfaultfd.h | 11 ++++++++++-
mm/huge_memory.c | 25 ++++++++++++++++++++++++-
mm/hugetlb.c | 32 +++++++++++++++++++++++++++++++-
mm/memory.c | 27 +++++++++++++++++++++++++--
mm/userfaultfd.c | 19 ++++++++++++++++++-
6 files changed, 114 insertions(+), 6 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 6b633ec694e1..dd3c8ba97296 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -281,6 +281,7 @@ extern void userfaultfd_unmap_complete(struct mm_struct *mm,
struct list_head *uf);
extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma);
extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
+extern bool userfaultfd_rwp_async(struct vm_area_struct *vma);
static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
{
@@ -459,6 +460,11 @@ static inline bool userfaultfd_wp_async(struct vm_area_struct *vma)
return false;
}
+static inline bool userfaultfd_rwp_async(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
{
return false;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index d803e76d47ad..c10f08f8a618 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -44,7 +44,8 @@
UFFD_FEATURE_POISON | \
UFFD_FEATURE_WP_ASYNC | \
UFFD_FEATURE_MOVE | \
- UFFD_FEATURE_RWP)
+ UFFD_FEATURE_RWP | \
+ UFFD_FEATURE_RWP_ASYNC)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -243,6 +244,13 @@ struct uffdio_api {
* UFFDIO_REGISTER_MODE_RWP for read-write protection tracking.
* Pages are made inaccessible via UFFDIO_RWPROTECT and faults
* are delivered when the pages are re-accessed.
+ *
+ * UFFD_FEATURE_RWP_ASYNC indicates asynchronous mode for
+ * UFFDIO_REGISTER_MODE_RWP. When set, faults on read-write
+ * protected pages are auto-resolved by the kernel (PTE
+ * permissions restored immediately) without delivering a message
+ * to the userfaultfd handler. Use PAGEMAP_SCAN with inverted
+ * PAGE_IS_ACCESSED to find pages that were not re-accessed.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -262,6 +270,7 @@ struct uffdio_api {
#define UFFD_FEATURE_WP_ASYNC (1<<15)
#define UFFD_FEATURE_MOVE (1<<16)
#define UFFD_FEATURE_RWP (1<<17)
+#define UFFD_FEATURE_RWP_ASYNC (1<<18)
__u64 features;
__u64 ioctls;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 72cb44332004..8f120452d995 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2291,7 +2291,30 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf)
{
- return handle_userfault(vmf, VM_UFFD_RWP);
+ struct vm_area_struct *vma = vmf->vma;
+ pmd_t pmd;
+
+ if (!userfaultfd_rwp_async(vma))
+ return handle_userfault(vmf, VM_UFFD_RWP);
+
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) {
+ spin_unlock(vmf->ptl);
+ return 0;
+ }
+ pmd = pmd_modify(vmf->orig_pmd, vma->vm_page_prot);
+ /* pmd_modify() preserves _PAGE_UFFD; drop it on resolution */
+ pmd = pmd_clear_uffd(pmd);
+ pmd = pmd_mkyoung(pmd);
+ if (!pmd_write(pmd) &&
+ vma_wants_manual_pte_write_upgrade(vma) &&
+ can_change_pmd_writable(vma, vmf->address, pmd))
+ pmd = pmd_mkwrite(pmd, vma);
+ set_pmd_at(vma->vm_mm, vmf->address & HPAGE_PMD_MASK,
+ vmf->pmd, pmd);
+ update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+ spin_unlock(vmf->ptl);
+ return 0;
}
/* NUMA hinting page fault entry point for trans huge pmds */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d4da39d698b8..9da52d95b3fb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6070,7 +6070,37 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (pte_protnone(vmf.orig_pte) && vma_is_accessible(vma) &&
userfaultfd_rwp(vma) && huge_pte_uffd(vmf.orig_pte)) {
- return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP);
+ spinlock_t *ptl;
+ pte_t pte;
+
+ /* Sync: drop hugetlb locks before blocking in handle_userfault() */
+ if (!userfaultfd_rwp_async(vma))
+ return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP);
+
+ ptl = huge_pte_lock(h, mm, vmf.pte);
+ pte = huge_ptep_get(mm, vmf.address, vmf.pte);
+ if (pte_protnone(pte) && huge_pte_uffd(pte)) {
+ unsigned int shift = huge_page_shift(h);
+
+ pte = huge_pte_modify(pte, vma->vm_page_prot);
+ pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ /* huge_pte_modify() preserves _PAGE_UFFD; drop it on resolution */
+ pte = huge_pte_clear_uffd(pte);
+ pte = pte_mkyoung(pte);
+ /*
+ * Unlike do_uffd_rwp(), do not upgrade to writable
+ * here. Hugetlb lacks a can_change_huge_pte_writable()
+ * equivalent, so a write access will take a separate
+ * COW fault — acceptable for the rare private hugetlb
+ * case.
+ */
+ set_huge_pte_at(mm, vmf.address, vmf.pte, pte,
+ huge_page_size(h));
+ update_mmu_cache(vma, vmf.address, vmf.pte);
+ }
+ spin_unlock(ptl);
+ ret = 0;
+ goto out_mutex;
}
/*
diff --git a/mm/memory.c b/mm/memory.c
index 4f8b8dff0b7f..43b5e63c368b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6149,8 +6149,31 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
static vm_fault_t do_uffd_rwp(struct vm_fault *vmf)
{
- pte_unmap(vmf->pte);
- return handle_userfault(vmf, VM_UFFD_RWP);
+ pte_t pte;
+
+ if (!userfaultfd_rwp_async(vmf->vma)) {
+ /* Sync mode: unmap PTE and deliver to userfaultfd handler */
+ pte_unmap(vmf->pte);
+ return handle_userfault(vmf, VM_UFFD_RWP);
+ }
+
+ spin_lock(vmf->ptl);
+ if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
+ pte = pte_modify(vmf->orig_pte, vmf->vma->vm_page_prot);
+ /* pte_modify() preserves _PAGE_UFFD; drop it on resolution */
+ pte = pte_clear_uffd(pte);
+ pte = pte_mkyoung(pte);
+ if (!pte_write(pte) &&
+ vma_wants_manual_pte_write_upgrade(vmf->vma) &&
+ can_change_pte_writable(vmf->vma, vmf->address, pte))
+ pte = pte_mkwrite(pte, vmf->vma);
+ set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte);
+ update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}
static vm_fault_t do_numa_page(struct vm_fault *vmf)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index db3707b9d977..f40bf473a6f6 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2487,6 +2487,11 @@ static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx)
return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC);
}
+static bool userfaultfd_rwp_async_ctx(struct userfaultfd_ctx *ctx)
+{
+ return ctx && (ctx->features & UFFD_FEATURE_RWP_ASYNC);
+}
+
/*
* Whether WP_UNPOPULATED is enabled on the uffd context. It is only
* meaningful when userfaultfd_wp()==true on the vma and when it's
@@ -4408,6 +4413,11 @@ bool userfaultfd_wp_async(struct vm_area_struct *vma)
return userfaultfd_wp_async_ctx(vma->vm_userfaultfd_ctx.ctx);
}
+bool userfaultfd_rwp_async(struct vm_area_struct *vma)
+{
+ return userfaultfd_rwp_async_ctx(vma->vm_userfaultfd_ctx.ctx);
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
@@ -4511,6 +4521,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
if (features & UFFD_FEATURE_WP_ASYNC)
features |= UFFD_FEATURE_WP_UNPOPULATED;
+ ret = -EINVAL;
+ /* RWP_ASYNC requires RWP */
+ if ((features & UFFD_FEATURE_RWP_ASYNC) &&
+ !(features & UFFD_FEATURE_RWP))
+ goto err_out;
+
/* report all available features and ioctls to userland */
uffdio_api.features = UFFD_API_FEATURES;
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
@@ -4533,7 +4549,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
* but not actually usable.
*/
if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
- uffdio_api.features &= ~UFFD_FEATURE_RWP;
+ uffdio_api.features &=
+ ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC);
ret = -EINVAL;
if (features & ~uffdio_api.features)
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 13/15] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (11 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 12/15] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 14/15] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 15/15] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau (Meta)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Add an ioctl to toggle async mode at runtime without re-registering
the userfaultfd. This allows a VMM to switch between sync and async
RWP modes on-the-fly -- for example, starting in async mode for
working set scanning, then switching to sync mode to intercept faults
during page eviction.
UFFDIO_SET_MODE takes an enable/disable bitmask of UFFD_FEATURE_*
flags. Only UFFD_FEATURE_RWP_ASYNC is toggleable today; the ioctl
rejects any other bit with -EINVAL. Enabling RWP_ASYNC also requires
RWP to have been negotiated at UFFDIO_API time, mirroring the
UFFDIO_API invariant.
Fault-path readers of ctx->features run under mmap_read_lock or a
per-VMA lock; the RMW takes mmap_write_lock and calls
vma_start_write() on every UFFD-armed VMA, so those readers are fully
excluded. userfaultfd_show_fdinfo(), however, reads ctx->features
without any lock, so the RMW is written as a single WRITE_ONCE and
fdinfo reads it with READ_ONCE. That keeps the lockless observer from
seeing a mid-RMW intermediate and removes the audit burden when new
toggleable bits are added later.
When switching to async, pending sync waiters are woken so they retry
and auto-resolve under the new mode.
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/uapi/linux/userfaultfd.h | 14 +++
mm/userfaultfd.c | 150 +++++++++++++++++++++++++------
2 files changed, 136 insertions(+), 28 deletions(-)
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index c10f08f8a618..cea11aad6b54 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -49,6 +49,7 @@
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
+ (__u64)1 << _UFFDIO_SET_MODE | \
(__u64)1 << _UFFDIO_API)
#define UFFD_API_RANGE_IOCTLS \
((__u64)1 << _UFFDIO_WAKE | \
@@ -85,6 +86,7 @@
#define _UFFDIO_CONTINUE (0x07)
#define _UFFDIO_POISON (0x08)
#define _UFFDIO_RWPROTECT (0x09)
+#define _UFFDIO_SET_MODE (0x0A)
#define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */
@@ -111,6 +113,8 @@
struct uffdio_poison)
#define UFFDIO_RWPROTECT _IOWR(UFFDIO, _UFFDIO_RWPROTECT, \
struct uffdio_rwprotect)
+#define UFFDIO_SET_MODE _IOW(UFFDIO, _UFFDIO_SET_MODE, \
+ struct uffdio_set_mode)
/* read() structure */
struct uffd_msg {
@@ -406,6 +410,16 @@ struct uffdio_move {
__s64 move;
};
+struct uffdio_set_mode {
+ /*
+ * Toggle async mode for features at runtime.
+ * Supported: UFFD_FEATURE_RWP_ASYNC.
+ * Setting a bit in both enable and disable is invalid.
+ */
+ __u64 enable;
+ __u64 disable;
+};
+
/*
* Flags for the userfaultfd(2) system call itself.
*/
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f40bf473a6f6..f172ec14a6c8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2477,19 +2477,29 @@ struct userfaultfd_wake_range {
/* internal indication that UFFD_API ioctl was successfully executed */
#define UFFD_FEATURE_INITIALIZED (1u << 31)
+/*
+ * UFFDIO_SET_MODE updates ctx->features under mmap_write_lock with
+ * WRITE_ONCE; readers that run outside mmap_read_lock or the per-VMA
+ * lock (poll/read_iter/ioctl, fdinfo) must pair with READ_ONCE.
+ */
+static unsigned int userfaultfd_features(struct userfaultfd_ctx *ctx)
+{
+ return READ_ONCE(ctx->features);
+}
+
static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
{
- return ctx->features & UFFD_FEATURE_INITIALIZED;
+ return userfaultfd_features(ctx) & UFFD_FEATURE_INITIALIZED;
}
static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx)
{
- return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC);
+ return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_WP_ASYNC);
}
static bool userfaultfd_rwp_async_ctx(struct userfaultfd_ctx *ctx)
{
- return ctx && (ctx->features & UFFD_FEATURE_RWP_ASYNC);
+ return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_RWP_ASYNC);
}
/*
@@ -2504,7 +2514,7 @@ bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
if (!ctx)
return false;
- return ctx->features & UFFD_FEATURE_WP_UNPOPULATED;
+ return userfaultfd_features(ctx) & UFFD_FEATURE_WP_UNPOPULATED;
}
static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode,
@@ -4290,6 +4300,109 @@ static int userfaultfd_rwprotect(struct userfaultfd_ctx *ctx,
return ret;
}
+/* Subset of UFFD_API_FEATURES actually supported by this kernel/arch */
+static __u64 uffd_api_available_features(void)
+{
+ __u64 f = UFFD_API_FEATURES;
+
+ if (!IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_MINOR))
+ f &= ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
+ if (!pgtable_supports_uffd())
+ f &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ if (!uffd_supports_wp_marker())
+ f &= ~(UFFD_FEATURE_WP_HUGETLBFS_SHMEM |
+ UFFD_FEATURE_WP_UNPOPULATED |
+ UFFD_FEATURE_WP_ASYNC);
+ /*
+ * RWP needs both PROT_NONE support and the uffd PTE bit. The
+ * VM_UFFD_RWP check covers compile-time unavailability; the
+ * pgtable_supports_uffd() check covers runtime (e.g. riscv
+ * without the SVRSW60T59B extension) where the PTE bit is declared
+ * but not actually usable.
+ */
+ if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
+ f &= ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC);
+ return f;
+}
+
+/* Async features that can be toggled at runtime via UFFDIO_SET_MODE */
+#define UFFD_FEATURE_TOGGLEABLE UFFD_FEATURE_RWP_ASYNC
+
+static int userfaultfd_set_mode(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ struct uffdio_set_mode mode;
+ struct mm_struct *mm = ctx->mm;
+
+ if (copy_from_user(&mode, (void __user *)arg, sizeof(mode)))
+ return -EFAULT;
+
+ /* enable and disable must not overlap */
+ if (mode.enable & mode.disable)
+ return -EINVAL;
+
+ /* only toggleable features that this kernel/arch actually supports */
+ if ((mode.enable | mode.disable) &
+ ~(uffd_api_available_features() & UFFD_FEATURE_TOGGLEABLE))
+ return -EINVAL;
+
+ /* RWP_ASYNC can only be enabled on contexts that negotiated RWP */
+ if ((mode.enable & UFFD_FEATURE_RWP_ASYNC) &&
+ !(userfaultfd_features(ctx) & UFFD_FEATURE_RWP))
+ return -EINVAL;
+
+ if (!mmget_not_zero(mm))
+ return -ESRCH;
+
+ /*
+ * Drain in-flight faults before flipping features. mmap_write_lock()
+ * blocks new mmap_read_lock() callers, but per-VMA locked faults
+ * (lock_vma_under_rcu() + FAULT_FLAG_VMA_LOCK) that acquired before
+ * this point keep running. Calling vma_start_write() on each UFFD-
+ * armed VMA waits for those readers to drop, so no in-flight fault
+ * can observe the old features after mmap_write_unlock().
+ */
+ mmap_write_lock(mm);
+ {
+ struct vm_area_struct *vma;
+ VMA_ITERATOR(vmi, mm, 0);
+
+ for_each_vma(vmi, vma) {
+ if (vma->vm_userfaultfd_ctx.ctx == ctx)
+ vma_start_write(vma);
+ }
+ }
+ /*
+ * Single WRITE_ONCE so lockless readers (fdinfo, poll/read_iter
+ * via userfaultfd_is_initialized(), and the userfaultfd_features()
+ * helper used elsewhere) can't observe a mid-RMW intermediate
+ * value. Hot-path readers already serialise through the mmap lock
+ * + vma_start_write() drain above, so their load doesn't need an
+ * annotation.
+ */
+ WRITE_ONCE(ctx->features,
+ (ctx->features | mode.enable) & ~mode.disable);
+ mmap_write_unlock(mm);
+
+ /*
+ * If switching to async, wake threads blocked in handle_userfault().
+ * They will retry the fault and auto-resolve under the new mode.
+ * len=0 means wake all pending faults on this context.
+ */
+ if (mode.enable & UFFD_FEATURE_RWP_ASYNC) {
+ struct userfaultfd_wake_range range = { .len = 0 };
+
+ spin_lock_irq(&ctx->fault_pending_wqh.lock);
+ __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
+ &range);
+ __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range);
+ spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+ }
+
+ mmput(mm);
+ return 0;
+}
+
static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
{
__s64 ret;
@@ -4528,29 +4641,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
goto err_out;
/* report all available features and ioctls to userland */
- uffdio_api.features = UFFD_API_FEATURES;
-#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
- uffdio_api.features &=
- ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
-#endif
- if (!pgtable_supports_uffd())
- uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
-
- if (!uffd_supports_wp_marker()) {
- uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
- uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
- uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC;
- }
- /*
- * RWP needs both PROT_NONE support and the uffd-wp PTE bit. The
- * VM_UFFD_RWP check covers compile-time unavailability; the
- * pgtable_supports_uffd() check covers runtime (e.g. riscv
- * without the SVRSW60T59B extension) where the PTE bit is declared
- * but not actually usable.
- */
- if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
- uffdio_api.features &=
- ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC);
+ uffdio_api.features = uffd_api_available_features();
ret = -EINVAL;
if (features & ~uffdio_api.features)
@@ -4620,6 +4711,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_RWPROTECT:
ret = userfaultfd_rwprotect(ctx, arg);
break;
+ case UFFDIO_SET_MODE:
+ ret = userfaultfd_set_mode(ctx, arg);
+ break;
}
return ret;
}
@@ -4647,7 +4741,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
* protocols: aa:... bb:...
*/
seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
- pending, total, UFFD_API, ctx->features,
+ pending, total, UFFD_API, userfaultfd_features(ctx),
UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
}
#endif
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 14/15] selftests/mm: add userfaultfd RWP tests
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (12 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 13/15] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
2026-06-02 22:18 ` Askar Safin
2026-05-29 17:26 ` [PATCH v6 15/15] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau (Meta)
14 siblings, 1 reply; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Coverage for UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT:
rwp-async async mode — touch pages, verify permissions are
auto-restored without a message
rwp-sync sync mode — access blocks, handler resolves via
UFFDIO_RWPROTECT
rwp-pagemap PAGEMAP_SCAN reports still-cold pages via
inverted PAGE_IS_ACCESSED
rwp-mprotect RWP survives mprotect(PROT_NONE) ->
mprotect(PROT_READ|PROT_WRITE) round-trip
rwp-gup GUP walks through a protnone RWP PTE (pipe
write/read drives the GUP path)
rwp-async-toggle UFFDIO_SET_MODE flips between sync and async
without re-registering
rwp-close closing the uffd restores page permissions
rwp-fork RWP survives fork() with EVENT_FORK; child's
PTEs keep the uffd bit
rwp-fork-pin RWP survives fork() on an RO-longterm-pinned
anon page (forces copy_present_page()); child
read auto-resolves and clears the bit, proving
PAGE_NONE was in place
rwp-wp-exclusive register with MODE_WP|MODE_RWP returns -EINVAL
All tests run against anon, shmem, shmem-private, hugetlb, and
hugetlb-private memory, except rwp-fork-pin which is anon-only —
copy_present_page() is the private-anon pinned-exclusive fork path.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
tools/testing/selftests/mm/uffd-unit-tests.c | 765 +++++++++++++++++++
1 file changed, 765 insertions(+)
diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index a6c14109e818..9f5a5ccf6044 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -7,6 +7,8 @@
#include "uffd-common.h"
+#include <linux/fs.h>
+#include <sys/uio.h>
#include "../../../../mm/gup_test.h"
#ifdef __NR_userfaultfd
@@ -109,6 +111,10 @@ static void uffd_test_skip(const char *message)
static void test_uffd_api(bool use_dev)
{
+ const uint64_t expected_ioctls =
+ BIT_ULL(_UFFDIO_REGISTER) |
+ BIT_ULL(_UFFDIO_UNREGISTER) |
+ BIT_ULL(_UFFDIO_API);
struct uffdio_api uffdio_api;
int uffd;
@@ -148,6 +154,15 @@ static void test_uffd_api(bool use_dev)
goto out;
}
+ /* Verify returned fd-level ioctls bitmask */
+ if ((uffdio_api.ioctls & expected_ioctls) != expected_ioctls) {
+ uffd_test_fail("UFFDIO_API missing expected ioctls: "
+ "got=0x%"PRIx64", expected=0x%"PRIx64,
+ (uint64_t)uffdio_api.ioctls,
+ expected_ioctls);
+ goto out;
+ }
+
/* Test double requests of UFFDIO_API with a random feature set */
uffdio_api.features = BIT_ULL(0);
if (ioctl(uffd, UFFDIO_API, &uffdio_api) == 0) {
@@ -602,6 +617,685 @@ void uffd_minor_collapse_test(uffd_global_test_opts_t *gopts, uffd_test_args_t *
uffd_minor_test_common(gopts, true, false);
}
+static int uffd_register_rwp(int uffd, void *addr, uint64_t len)
+{
+ struct uffdio_register reg = {
+ .range = { .start = (unsigned long)addr, .len = len },
+ .mode = UFFDIO_REGISTER_MODE_RWP,
+ };
+
+ if (ioctl(uffd, UFFDIO_REGISTER, ®) == -1)
+ return -errno;
+ return 0;
+}
+
+static void rwprotect_range(int uffd, __u64 start, __u64 len, bool protect)
+{
+ struct uffdio_rwprotect rwp = {
+ .range = { .start = start, .len = len },
+ .mode = protect ? UFFDIO_RWPROTECT_MODE_RWP : 0,
+ };
+
+ if (ioctl(uffd, UFFDIO_RWPROTECT, &rwp))
+ err("UFFDIO_RWPROTECT failed");
+}
+
+static void set_async_mode(int uffd, bool enable)
+{
+ struct uffdio_set_mode mode = { };
+
+ if (enable)
+ mode.enable = UFFD_FEATURE_RWP_ASYNC;
+ else
+ mode.disable = UFFD_FEATURE_RWP_ASYNC;
+
+ if (ioctl(uffd, UFFDIO_SET_MODE, &mode))
+ err("UFFDIO_SET_MODE failed");
+}
+
+/*
+ * Test async RWP faults on anonymous memory.
+ * Populate pages, register MODE_RWP with RWP_ASYNC,
+ * RW-protect, re-access, verify content preserved and no faults delivered.
+ */
+static void uffd_rwp_async_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+
+ /* Populate all pages with known content */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ /* Register MODE_RWP */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ /* RW-protect all pages (sets protnone) */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Access all pages — should auto-resolve, no faults */
+ for (p = 0; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+ unsigned char expected = p % 255 + 1;
+
+ if (page[0] != expected) {
+ uffd_test_fail("page %lu content mismatch: %u != %u",
+ p, page[0], expected);
+ return;
+ }
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Fault handler for RWP — unprotect the page via UFFDIO_RWPROTECT.
+ */
+static void uffd_handle_rwp_fault(uffd_global_test_opts_t *gopts,
+ struct uffd_msg *msg,
+ struct uffd_args *uargs)
+{
+ if (!(msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_RWP))
+ err("expected RWP fault, got 0x%llx",
+ msg->arg.pagefault.flags);
+
+ rwprotect_range(gopts->uffd, msg->arg.pagefault.address,
+ gopts->page_size, false);
+ uargs->minor_faults++;
+}
+
+/*
+ * Test sync RWP faults on anonymous memory.
+ * Populate pages, register MODE_RWP (sync), RW-protect,
+ * access from worker thread, verify fault delivered, UFFDIO_RWPROTECT resolves.
+ */
+static void uffd_rwp_sync_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ pthread_t uffd_mon;
+ struct uffd_args uargs = { };
+ bool failed = false;
+ char c = '\0';
+ unsigned long p;
+
+ uargs.gopts = gopts;
+ uargs.handle_fault = uffd_handle_rwp_fault;
+
+ /* Populate all pages */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ /* Register MODE_RWP */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ /* RW-protect all pages */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Start fault handler thread */
+ if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs))
+ err("uffd_poll_thread create");
+
+ /* Access all pages — triggers sync RWP faults, handler unprotects */
+ for (p = 0; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+
+ if (page[0] != (p % 255 + 1)) {
+ uffd_test_fail("page %lu content mismatch", p);
+ failed = true;
+ goto out;
+ }
+ }
+
+out:
+ /*
+ * Stop the handler before reading minor_faults: the last fault
+ * resolution rwprotect_range()s before incrementing the counter,
+ * so the main thread can race ahead of the increment.
+ */
+ if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+ err("pipe write");
+ if (pthread_join(uffd_mon, NULL))
+ err("join() failed");
+
+ if (failed)
+ return;
+ if (uargs.minor_faults == 0)
+ uffd_test_fail("expected RWP faults, got 0");
+ else
+ uffd_test_pass();
+}
+
+/*
+ * Test PAGEMAP_SCAN detection of RW-protected (cold) pages.
+ */
+static void uffd_rwp_pagemap_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+ struct page_region regions[16];
+ struct pm_scan_arg pm_arg;
+ int pagemap_fd;
+ long ret;
+
+ /* Need at least 4 pages */
+ if (nr_pages < 4) {
+ uffd_test_skip("need at least 4 pages");
+ return;
+ }
+
+ /* Populate all pages */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, 0xab, page_size);
+
+ /* Register and RW-protect */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Touch first half of pages to re-activate them (async auto-resolve) */
+ for (p = 0; p < nr_pages / 2; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page;
+ }
+
+ /* Scan for cold (still RW-protected) pages */
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (pagemap_fd < 0)
+ err("open pagemap");
+
+ /*
+ * PAGE_IS_ACCESSED is set once the uffd-wp bit has been cleared
+ * (access happened, or the user resolved). Invert it to select
+ * still-protected (cold) pages.
+ */
+ memset(&pm_arg, 0, sizeof(pm_arg));
+ pm_arg.size = sizeof(pm_arg);
+ pm_arg.start = (uint64_t)gopts->area_dst;
+ pm_arg.end = (uint64_t)gopts->area_dst + nr_pages * page_size;
+ pm_arg.vec = (uint64_t)regions;
+ pm_arg.vec_len = ARRAY_SIZE(regions);
+ pm_arg.category_mask = PAGE_IS_ACCESSED;
+ pm_arg.category_inverted = PAGE_IS_ACCESSED;
+ pm_arg.return_mask = PAGE_IS_ACCESSED;
+
+ ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg);
+ close(pagemap_fd);
+
+ if (ret < 0) {
+ uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno));
+ return;
+ }
+
+ /*
+ * The second half of pages should be reported as RW-protected.
+ * They may be coalesced into one region.
+ */
+ if (ret < 1) {
+ uffd_test_fail("expected cold pages, got %ld regions", ret);
+ return;
+ }
+
+ /* Verify the cold region covers the second half */
+ uint64_t cold_start = regions[0].start;
+ uint64_t expected_start = (uint64_t)gopts->area_dst +
+ (nr_pages / 2) * page_size;
+
+ if (cold_start != expected_start) {
+ uffd_test_fail("cold region starts at 0x%lx, expected 0x%lx",
+ (unsigned long)cold_start,
+ (unsigned long)expected_start);
+ return;
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Test that RWP protection survives a mprotect(PROT_NONE) ->
+ * mprotect(PROT_READ|PROT_WRITE) round-trip. The uffd-wp bit on a
+ * VM_UFFD_RWP VMA must continue to carry PROT_NONE semantics after
+ * mprotect() changes the base protection; otherwise accesses would
+ * silently succeed and the pagemap bit would stick without a fault
+ * ever clearing it.
+ */
+static void uffd_rwp_mprotect_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+ struct page_region regions[16];
+ struct pm_scan_arg pm_arg;
+ int pagemap_fd;
+ long ret;
+
+ /* Populate all pages */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, 0xab, page_size);
+
+ /* Register and RW-protect the whole range */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Round-trip mprotect(): PROT_NONE -> PROT_READ|PROT_WRITE */
+ if (mprotect(gopts->area_dst, nr_pages * page_size, PROT_NONE))
+ err("mprotect() PROT_NONE");
+ if (mprotect(gopts->area_dst, nr_pages * page_size,
+ PROT_READ | PROT_WRITE))
+ err("mprotect() PROT_READ|PROT_WRITE");
+
+ /* Touch every page. Async RWP must auto-resolve each fault. */
+ for (p = 0; p < nr_pages; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page;
+ }
+
+ /*
+ * After touching, no page should remain RW-protected. A stuck
+ * uffd-wp bit would mean mprotect() silently dropped PROT_NONE and
+ * the access never faulted.
+ */
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (pagemap_fd < 0)
+ err("open pagemap");
+
+ memset(&pm_arg, 0, sizeof(pm_arg));
+ pm_arg.size = sizeof(pm_arg);
+ pm_arg.start = (uint64_t)gopts->area_dst;
+ pm_arg.end = (uint64_t)gopts->area_dst + nr_pages * page_size;
+ pm_arg.vec = (uint64_t)regions;
+ pm_arg.vec_len = ARRAY_SIZE(regions);
+ pm_arg.category_mask = PAGE_IS_ACCESSED;
+ pm_arg.category_inverted = PAGE_IS_ACCESSED;
+ pm_arg.return_mask = PAGE_IS_ACCESSED;
+
+ ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg);
+ close(pagemap_fd);
+
+ if (ret < 0) {
+ uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno));
+ return;
+ }
+ if (ret != 0) {
+ uffd_test_fail("expected no cold pages after mprotect()+touch, got %ld regions",
+ ret);
+ return;
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Test that GUP resolves through protnone PTEs (async mode).
+ * vmsplice() into a pipe pins user pages via get_user_pages_fast() --
+ * unlike write(), which goes through copy_from_user() and ordinary
+ * hardware page faults -- so it exercises gup_can_follow_protnone() on
+ * the RW-protected PTE. In async mode the kernel auto-restores
+ * permissions and GUP returns the page.
+ */
+static void uffd_rwp_gup_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ struct iovec iov;
+ char buf;
+ int pipefd[2];
+
+ /* Populate first page with known content */
+ memset(gopts->area_dst, 0xCD, gopts->page_size);
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst, gopts->page_size))
+ err("register failure");
+
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ gopts->page_size, true);
+
+ if (pipe(pipefd))
+ err("pipe");
+
+ /*
+ * One byte's worth of iov is enough to GUP the containing page and
+ * keeps the pipe transfer well under any pipe-capacity limit even on
+ * hugetlb-backed runs.
+ */
+ iov.iov_base = gopts->area_dst;
+ iov.iov_len = 1;
+ if (vmsplice(pipefd[1], &iov, 1, 0) != 1) {
+ uffd_test_fail("vmsplice from RW-protected page failed: %s",
+ strerror(errno));
+ goto out;
+ }
+
+ if (read(pipefd[0], &buf, 1) != 1) {
+ uffd_test_fail("read from pipe failed");
+ goto out;
+ }
+
+ if (buf != (char)0xCD) {
+ uffd_test_fail("content mismatch: got 0x%02x, expected 0xCD",
+ (unsigned char)buf);
+ goto out;
+ }
+
+ uffd_test_pass();
+out:
+ close(pipefd[0]);
+ close(pipefd[1]);
+}
+
+/*
+ * Test runtime toggle between async and sync modes.
+ * Start in async mode (detection), flip to sync (eviction), verify faults
+ * block, resolve them, flip back to async.
+ */
+static void uffd_rwp_async_toggle_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ struct uffd_args uargs = { };
+ pthread_t uffd_mon;
+ char c = '\0';
+ unsigned long p;
+
+ uargs.gopts = gopts;
+ uargs.handle_fault = uffd_handle_rwp_fault;
+
+ /* Populate */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ /* Phase 1: async detection — RW-protect, access first half */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ for (p = 0; p < nr_pages / 2; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page; /* auto-resolves in async mode */
+ }
+
+ /* Phase 2: flip to sync for eviction */
+ set_async_mode(gopts->uffd, false);
+
+ /* Start handler — will receive faults for cold pages */
+ if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs))
+ err("uffd_poll_thread create");
+
+ /* Access second half (cold pages) — should trigger sync faults */
+ for (p = nr_pages / 2; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+ if (page[0] != (p % 255 + 1)) {
+ uffd_test_fail("page %lu content mismatch", p);
+ goto out;
+ }
+ }
+
+ /*
+ * Stop the handler before reading minor_faults: the last fault
+ * resolution rwprotect_range()s before incrementing the counter,
+ * so the main thread can race ahead of the increment. Stopping
+ * here also makes Phase 3 a clean async-only test -- with the
+ * handler still running it would silently resolve any sync fault
+ * the kernel erroneously delivers, masking a regression.
+ */
+ if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+ err("pipe write");
+ if (pthread_join(uffd_mon, NULL))
+ err("join() failed");
+
+ if (uargs.minor_faults == 0) {
+ uffd_test_fail("expected sync faults, got 0");
+ return;
+ }
+
+ /* Phase 3: flip back to async */
+ set_async_mode(gopts->uffd, true);
+
+ /* RW-protect and access again — should auto-resolve */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ for (p = 0; p < nr_pages; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page;
+ }
+
+ uffd_test_pass();
+ return;
+out:
+ if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+ err("pipe write");
+ if (pthread_join(uffd_mon, NULL))
+ err("join() failed");
+}
+
+/*
+ * Test that RW-protected pages become accessible after closing uffd.
+ */
+static void uffd_rwp_close_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+
+ /* Populate */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Close uffd — should restore protnone PTEs */
+ close(gopts->uffd);
+ gopts->uffd = -1;
+
+ /* All pages should be accessible with original content */
+ for (p = 0; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+ unsigned char expected = p % 255 + 1;
+
+ if (page[0] != expected) {
+ uffd_test_fail("page %lu not accessible after close", p);
+ return;
+ }
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Test that RWP protection is preserved across fork() when
+ * UFFD_FEATURE_EVENT_FORK is enabled. Without preservation, the child's
+ * PTEs would lose the uffd-wp marker and RWP-protected accesses would
+ * silently fall through to do_numa_page().
+ */
+static void uffd_rwp_fork_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ int pagemap_fd;
+ uint64_t value;
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failed");
+
+ /* Populate + RWP-protect */
+ *gopts->area_dst = 1;
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ page_size, true);
+
+ /* Parent: verify uffd-wp bit is set before fork */
+ pagemap_fd = pagemap_open();
+ value = pagemap_get_entry(pagemap_fd, gopts->area_dst);
+ pagemap_check_wp(value, true);
+
+ /*
+ * Fork with EVENT_FORK: child inherits VM_UFFD_RWP. Child reads
+ * its own pagemap and must still see the uffd-wp bit set.
+ */
+ if (pagemap_test_fork(gopts, true, false)) {
+ uffd_test_fail("RWP marker lost in child after fork");
+ goto out;
+ }
+
+ uffd_test_pass();
+out:
+ close(pagemap_fd);
+}
+
+/*
+ * Test that RWP protection on a pinned anon page is preserved across fork().
+ * Pinning forces copy_present_page() in the child path, which must restore
+ * PAGE_NONE on top of the uffd bit. Using async mode, a read in the child
+ * auto-resolves if — and only if — the PTE was actually protnone+uffd; the
+ * cleared uffd bit afterward proves the fault path ran.
+ */
+static void uffd_rwp_fork_pin_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long page_size = gopts->page_size;
+ fork_event_args fevent_args = { .gopts = gopts, .child_uffd = -1 };
+ pin_args pin_args = {};
+ int pagemap_fd, status;
+ pthread_t fevent_thread;
+ uint64_t value;
+ pid_t child;
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst, page_size))
+ err("register failed");
+
+ /* Populate. */
+ *gopts->area_dst = 1;
+
+ /* RO-longterm pin so fork() takes copy_present_page() for this PTE. */
+ if (pin_pages(&pin_args, gopts->area_dst, page_size)) {
+ uffd_test_skip("Possibly CONFIG_GUP_TEST missing or unprivileged");
+ uffd_unregister(gopts->uffd, gopts->area_dst, page_size);
+ return;
+ }
+
+ /* RWP-protect: PTE is now PAGE_NONE + uffd bit. */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, page_size, true);
+
+ pagemap_fd = pagemap_open();
+ value = pagemap_get_entry(pagemap_fd, gopts->area_dst);
+ pagemap_check_wp(value, true);
+
+ /*
+ * UFFD_FEATURE_EVENT_FORK is required so the child inherits
+ * VM_UFFD_RWP and the marker; without it dup_userfaultfd() resets
+ * the child VMA and the test would pass for the wrong reason.
+ * dup_userfaultfd() blocks until the EVENT_FORK message is consumed,
+ * so spawn a reader before the fork().
+ */
+ gopts->ready_for_fork = false;
+ if (pthread_create(&fevent_thread, NULL, fork_event_consumer,
+ &fevent_args))
+ err("pthread_create() for fork event consumer");
+ while (!gopts->ready_for_fork)
+ ; /* Wait for consumer to start polling. */
+
+ child = fork();
+ if (child < 0)
+ err("fork");
+ if (child == 0) {
+ volatile char c;
+ int cfd;
+
+ /*
+ * Read the pinned page. Only reaches the fault path if the
+ * child PTE is protnone + uffd; async mode auto-resolves and
+ * clears the uffd bit. If copy_present_page() dropped
+ * PAGE_NONE, the read would silently succeed and the bit
+ * would still be set.
+ */
+ c = *(volatile char *)gopts->area_dst;
+ (void)c;
+
+ cfd = pagemap_open();
+ value = pagemap_get_entry(cfd, gopts->area_dst);
+ close(cfd);
+ _exit((value & PM_UFFD_WP) ? 1 : 0);
+ }
+ if (waitpid(child, &status, 0) < 0)
+ err("waitpid");
+ if (pthread_join(fevent_thread, NULL))
+ err("pthread_join() for fork event consumer");
+ if (fevent_args.child_uffd >= 0)
+ close(fevent_args.child_uffd);
+
+ unpin_pages(&pin_args);
+ close(pagemap_fd);
+ if (uffd_unregister(gopts->uffd, gopts->area_dst, page_size))
+ err("unregister failed");
+
+ if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
+ uffd_test_fail("RWP not enforced in child after pinned fork");
+ return;
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * WP and RWP share the uffd-wp PTE bit and cannot coexist in the same VMA.
+ * Registration requesting both modes must be rejected.
+ */
+static void uffd_rwp_wp_exclusive_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ struct uffdio_register reg = { };
+
+ reg.range.start = (unsigned long)gopts->area_dst;
+ reg.range.len = nr_pages * page_size;
+ reg.mode = UFFDIO_REGISTER_MODE_WP | UFFDIO_REGISTER_MODE_RWP;
+
+ if (ioctl(gopts->uffd, UFFDIO_REGISTER, ®) == 0) {
+ uffd_test_fail("register with WP|RWP unexpectedly succeeded");
+ return;
+ }
+ if (errno != EINVAL) {
+ uffd_test_fail("register with WP|RWP: expected EINVAL, got %d",
+ errno);
+ return;
+ }
+ uffd_test_pass();
+}
+
static sigjmp_buf jbuf, *sigbuf;
static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
@@ -1604,6 +2298,77 @@ uffd_test_case_t uffd_tests[] = {
/* We can't test MADV_COLLAPSE, so try our luck */
.uffd_feature_required = UFFD_FEATURE_MINOR_SHMEM,
},
+ {
+ .name = "rwp-async",
+ .uffd_fn = uffd_rwp_async_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-sync",
+ .uffd_fn = uffd_rwp_sync_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required = UFFD_FEATURE_RWP,
+ },
+ {
+ .name = "rwp-pagemap",
+ .uffd_fn = uffd_rwp_pagemap_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-mprotect",
+ .uffd_fn = uffd_rwp_mprotect_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-gup",
+ .uffd_fn = uffd_rwp_gup_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-async-toggle",
+ .uffd_fn = uffd_rwp_async_toggle_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-close",
+ .uffd_fn = uffd_rwp_close_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required = UFFD_FEATURE_RWP,
+ },
+ {
+ .name = "rwp-fork",
+ .uffd_fn = uffd_rwp_fork_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_EVENT_FORK,
+ },
+ {
+ .name = "rwp-fork-pin",
+ .uffd_fn = uffd_rwp_fork_pin_test,
+ .mem_targets = MEM_ANON,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC |
+ UFFD_FEATURE_EVENT_FORK,
+ },
+ {
+ .name = "rwp-wp-exclusive",
+ .uffd_fn = uffd_rwp_wp_exclusive_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP |
+ UFFD_FEATURE_PAGEFAULT_FLAG_WP |
+ UFFD_FEATURE_WP_HUGETLBFS_SHMEM,
+ },
{
.name = "sigbus",
.uffd_fn = uffd_sigbus_test,
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v6 15/15] Documentation/userfaultfd: document RWP working set tracking
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
` (13 preceding siblings ...)
2026-05-29 17:26 ` [PATCH v6 14/15] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau (Meta)
@ 2026-05-29 17:26 ` Kiryl Shutsemau (Meta)
14 siblings, 0 replies; 21+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-05-29 17:26 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas
Add an admin-guide section covering UFFDIO_REGISTER_MODE_RWP:
- sync and async fault models;
- UFFDIO_RWPROTECT semantics;
- UFFD_FEATURE_RWP_ASYNC;
- UFFDIO_SET_MODE runtime mode flips.
It also covers typical VMM working-set-tracking workflow from detection
loop through sync-mode eviction and back to async.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
Documentation/admin-guide/mm/userfaultfd.rst | 243 ++++++++++++++++++-
1 file changed, 237 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 1e533639fd50..2a72e54962c8 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -275,16 +275,16 @@ tracking and it can be different in a few ways:
- Dirty information will not get lost if the pte was zapped due to
various reasons (e.g. during split of a shmem transparent huge page).
- - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit
- set; dirty when uffd-wp bit cleared), it has different semantics on
- some of the memory operations. For example: ``MADV_DONTNEED`` on
+ - Due to a reverted meaning of soft-dirty (page clean when the uffd bit
+ is set; dirty when the uffd bit is cleared), it has different semantics
+ on some of the memory operations. For example: ``MADV_DONTNEED`` on
anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as
- dirtying of memory by dropping uffd-wp bit during the procedure.
+ dirtying of memory by dropping the uffd bit during the procedure.
The user app can collect the "written/dirty" status by looking up the
-uffd-wp bit for the pages being interested in /proc/pagemap.
+uffd bit for the pages being interested in /proc/pagemap.
-The page will not be under track of uffd-wp async mode until the page is
+The page will not be under track of userfaultfd-wp async mode until the page is
explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode
flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
that was tracked by async mode userfaultfd-wp is invalid.
@@ -307,6 +307,237 @@ transparent to the guest, we want that same address range to act as if it was
still poisoned, even though it's on a new physical host which ostensibly
doesn't have a memory error in the exact same spot.
+Read-Write Protection
+---------------------
+
+``UFFDIO_REGISTER_MODE_RWP`` enables read-write protection tracking on a
+memory range. It is similar to (but faster than) ``mprotect(PROT_NONE)``
+combined with a signal handler; unlike ``mprotect(PROT_NONE)``, RWP only
+traps accesses to *present* PTEs, so accesses to unpopulated addresses in a
+protected range fall through to the normal missing-page path. It uses the
+PROT_NONE hinting mechanism (same as NUMA balancing) to make pages
+inaccessible while keeping them resident in memory. Works on anonymous,
+shmem, and hugetlbfs memory.
+
+RWP is designed for VM memory managers that need to track the working set
+of guest memory for cold page eviction to tiered or remote storage.
+
+**Setup:**
+
+1. Open a userfaultfd and enable ``UFFD_FEATURE_RWP`` via ``UFFDIO_API``.
+ Optionally request ``UFFD_FEATURE_RWP_ASYNC`` as well — it requires
+ ``UFFD_FEATURE_RWP`` to be set in the same ``UFFDIO_API`` call.
+
+2. Register the guest memory range with ``UFFDIO_REGISTER_MODE_RWP``
+ (and ``UFFDIO_REGISTER_MODE_MISSING`` if evicted pages will need to be
+ fetched back from storage).
+
+**Feature availability:**
+
+RWP is built on top of two kernel primitives: a spare PTE bit owned by
+userfaultfd (``CONFIG_HAVE_ARCH_USERFAULTFD_WP``) and architecture support
+for present-but-inaccessible PTEs (``CONFIG_ARCH_HAS_PTE_PROTNONE``). When both
+are available on a 64-bit kernel, the build selects
+``CONFIG_USERFAULTFD_RWP=y`` and the ``VM_UFFD_RWP`` VMA flag becomes
+available.
+
+``UFFD_FEATURE_RWP`` and ``UFFD_FEATURE_RWP_ASYNC`` are unavailable when
+the running kernel or architecture does not support them — for example
+32-bit kernels (where ``VM_UFFD_RWP`` is unavailable), kernels built
+without ``CONFIG_USERFAULTFD_RWP``, and architectures whose ptes cannot
+carry the uffd bit at runtime (e.g. riscv without the ``SVRSW60T59B``
+extension). Requesting an unsupported feature in
+``uffdio_api.features`` makes ``UFFDIO_API`` fail with ``EINVAL`` and
+leaves the userfaultfd context uninitialized; the bitmask returned in
+``uffdio_api.features`` then advertises the features the kernel does
+support. The recommended probe sequence is therefore to open a
+throwaway userfaultfd, call ``UFFDIO_API`` once with ``features = 0``,
+inspect the returned bitmask, close that fd, then open the real one
+and call ``UFFDIO_API`` again with only the supported features set.
+
+**Protecting and Unprotecting:**
+
+Use ``UFFDIO_RWPROTECT`` to protect or unprotect a range, mirroring the
+``UFFDIO_WRITEPROTECT`` interface::
+
+ struct uffdio_rwprotect rwp = {
+ .range = { .start = addr, .len = len },
+ .mode = UFFDIO_RWPROTECT_MODE_RWP, /* protect */
+ };
+ ioctl(uffd, UFFDIO_RWPROTECT, &rwp);
+
+Setting ``UFFDIO_RWPROTECT_MODE_RWP`` sets PROT_NONE on present PTEs in the
+range. Pages stay resident and their physical frames are preserved — only
+access permissions are removed.
+
+Clearing ``UFFDIO_RWPROTECT_MODE_RWP`` restores normal VMA permissions and
+wakes any faulting threads (unless ``UFFDIO_RWPROTECT_MODE_DONTWAKE`` is set).
+
+**Scope of protection:**
+
+RWP protection is a property of *present* PTEs. ``UFFDIO_RWPROTECT`` only
+affects entries that are already populated. Unpopulated addresses within
+the range remain unpopulated; when first accessed they fault through the
+normal missing path (``do_anonymous_page()``, ``do_swap_page()``,
+``finish_fault()``) and the resulting PTE is not RWP-protected. To observe
+the population itself, co-register the range with
+``UFFDIO_REGISTER_MODE_MISSING``.
+
+Protection is preserved across page reclaim: a page swapped out while
+RWP-protected carries the marker on its swap entry, and swap-in restores
+the PROT_NONE state so the first access after swap-in still faults. The
+same applies to pages temporarily replaced by migration entries.
+
+Operations that drop the PTE entirely — ``MADV_DONTNEED`` on anonymous
+memory, hole-punch on shmem, truncation of a file mapping — also drop the
+RWP marker: the next access re-populates the range without protection.
+Unlike WP (which persists via ``PTE_MARKER_UFFD_WP``), there is no
+persistent RWP marker today. The user needs to re-arm the range with
+``UFFDIO_RWPROTECT`` after any operation that explicitly frees PTEs.
+
+**Fault Handling:**
+
+When a protected page is accessed:
+
+- **Sync mode** (default): The faulting thread blocks and a
+ ``UFFD_PAGEFAULT_FLAG_RWP`` message is delivered to the userfaultfd
+ handler. The handler resolves the fault with ``UFFDIO_RWPROTECT``
+ (clearing ``MODE_RWP``), which restores the PTE permissions and wakes
+ the faulting thread.
+
+- **Async mode** (``UFFD_FEATURE_RWP_ASYNC``): The kernel automatically
+ restores PTE permissions and the thread continues without blocking. No
+ message is delivered to the handler.
+
+**Runtime Mode Switching:**
+
+``UFFDIO_SET_MODE`` toggles ``UFFD_FEATURE_RWP_ASYNC`` at runtime, allowing
+the VMM to switch between lightweight async detection and safe sync
+eviction without re-registering. The toggle takes ``mmap_write_lock()``
+and calls ``vma_start_write()`` on each UFFD-armed VMA, draining
+in-flight per-VMA-locked faults before the new mode takes effect.
+
+**Cold Page Detection with PAGEMAP_SCAN:**
+
+RWP-protected PTEs carry the uffd PTE bit; the fault-resolution path
+clears it. ``PAGEMAP_SCAN`` reports ``PAGE_IS_ACCESSED`` once the bit is
+clear on a ``VM_UFFD_RWP`` VMA, so inverting it efficiently reports the
+still-protected (cold) pages. Require ``PAGE_IS_PRESENT`` too so memory
+holes (which carry neither category bit) are filtered out::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .start = guest_mem_start,
+ .end = guest_mem_end,
+ .vec = (uint64_t)regions,
+ .vec_len = regions_len,
+ .category_mask = PAGE_IS_PRESENT | PAGE_IS_ACCESSED,
+ .category_inverted = PAGE_IS_ACCESSED,
+ .return_mask = PAGE_IS_ACCESSED,
+ };
+ long n = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
+
+The returned ``page_region`` array contains contiguous cold ranges that can
+then be evicted.
+
+**Cleanup:**
+
+When the userfaultfd is closed or the range is unregistered, all PROT_NONE
+PTEs are automatically restored to their normal VMA permissions. This
+prevents pages from becoming permanently inaccessible.
+
+**VMM Working Set Tracking Workflow:**
+
+A typical VMM lifecycle for cold page eviction to tiered storage. Two
+mappings of the same shmem (or hugetlbfs) file are used: ``guest_mem`` is
+the RWP-registered mapping that vCPUs access through, and ``io_mem`` is a
+private mapping for VMM-side I/O. Reading ``io_mem`` does not go through
+the RWP-protected PTEs of ``guest_mem``, so the VMM's own ``pwrite()``
+never traps on its own ::
+
+ /* One-time setup */
+ fd = memfd_create("guest", MFD_CLOEXEC);
+ ftruncate(fd, guest_size);
+ guest_mem = mmap(NULL, guest_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0); /* vCPU view, RWP-registered */
+ io_mem = mmap(NULL, guest_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0); /* VMM I/O view, unprotected */
+
+ uffd = userfaultfd(O_CLOEXEC | O_NONBLOCK);
+ struct uffdio_api api = {
+ .api = UFFD_API,
+ .features = UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ };
+ ioctl(uffd, UFFDIO_API, &api);
+ if (!(api.features & UFFD_FEATURE_RWP))
+ /* RWP unavailable on this kernel/arch -- fall back. */
+ ioctl(uffd, UFFDIO_REGISTER, &(struct uffdio_register){
+ .range = { guest_mem, guest_size },
+ .mode = UFFDIO_REGISTER_MODE_RWP |
+ UFFDIO_REGISTER_MODE_MISSING,
+ });
+
+ /* Tracking loop */
+ while (vm_running) {
+ /* 1. Detection phase (async -- no vCPU stalls) */
+ ioctl(uffd, UFFDIO_RWPROTECT, &(struct uffdio_rwprotect){
+ .range = full_range,
+ .mode = UFFDIO_RWPROTECT_MODE_RWP });
+ sleep(tracking_interval);
+
+ /*
+ * 2. Switch to sync BEFORE scanning. In async mode a vCPU
+ * access between the scan and any eviction step silently
+ * clears the uffd bit, so the scan would already disagree
+ * with the page state by the time eviction begins. Sync mode
+ * blocks vCPU accesses, freezing the cold snapshot for the
+ * rest of the iteration.
+ */
+ ioctl(uffd, UFFDIO_SET_MODE,
+ &(struct uffdio_set_mode){
+ .disable = UFFD_FEATURE_RWP_ASYNC });
+
+ /* 3. Find cold pages (uffd bit still set, page present) */
+ ioctl(pagemap_fd, PAGEMAP_SCAN, &(struct pm_scan_arg){
+ .category_mask = PAGE_IS_PRESENT | PAGE_IS_ACCESSED,
+ .category_inverted = PAGE_IS_ACCESSED,
+ .return_mask = PAGE_IS_ACCESSED,
+ ...
+ });
+
+ /* 4. Evict cold pages (vCPU faults block on guest_mem) */
+ for each cold range:
+ /* Read from io_mem -- bypasses RWP, no fault. */
+ pwrite(storage_fd, (char *)io_mem + cold_offset,
+ len, cold_offset);
+ /* Drop the page from the shared file. */
+ fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ cold_offset, len);
+ /*
+ * Wake any vCPU blocked on the RWP fault for this range:
+ * fallocate() does not iterate ctx->fault_pending_wqh.
+ */
+ ioctl(uffd, UFFDIO_WAKE, &(struct uffdio_range){
+ .start = (uintptr_t)guest_mem + cold_offset,
+ .len = len });
+
+ /* 5. Resume async tracking */
+ ioctl(uffd, UFFDIO_SET_MODE,
+ &(struct uffdio_set_mode){
+ .enable = UFFD_FEATURE_RWP_ASYNC });
+ }
+
+During step 4, a vCPU that accesses ``guest_mem + cold_offset`` blocks
+with a ``UFFD_PAGEFAULT_FLAG_RWP`` fault while the eviction is in
+progress. After ``fallocate()`` punches the page out and ``UFFDIO_WAKE``
+fires, the vCPU retries the access, faults as ``MISSING``, and the
+handler resolves it with ``UFFDIO_COPY`` from storage.
+
+This workflow targets shmem and hugetlbfs (both support a private
+``io_mem`` mapping over the same fd). Anonymous-memory backings need a
+different inner-loop strategy because the VMM has no way to read the
+page without going through the RWP-protected mapping.
+
QEMU/KVM
========
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API
2026-05-29 17:26 ` [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API Kiryl Shutsemau (Meta)
@ 2026-06-02 10:07 ` Mike Rapoport
2026-06-03 12:54 ` Lorenzo Stoakes
1 sibling, 0 replies; 21+ messages in thread
From: Mike Rapoport @ 2026-06-02 10:07 UTC (permalink / raw)
To: Kiryl Shutsemau (Meta)
Cc: akpm, peterx, david, ljs, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team
On Fri, May 29, 2026 at 06:26:33PM +0100, Kiryl Shutsemau (Meta) wrote:
> The uffd VMA-flag helpers read vma->vm_flags directly. Now that
> config-gated per-mode masks exist, switch them to the vma_flags_t
> accessor vma_test_any_mask(), which is the going-forward API and keeps a
> single place (the VMA_UFFD_* masks) that knows which modes are available
> on the current build.
>
> No functional change: vma_flags_t is in union with vm_flags, so the same
> bits are read, and the masks fold to the same code the open-coded
> vm_flags tests produced -- verified identical on gcc and clang, 32- and
> 64-bit.
>
> Suggested-by: Lorenzo Stoakes <ljs@kernel.org>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Assisted-by: Claude:claude-opus-4-8
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> include/linux/userfaultfd_k.h | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 658740df2978..c4f2cc6dfcf0 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -178,7 +178,8 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> */
> static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> + return vma_test_any_mask(vma,
> + mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
> }
>
> /*
> @@ -190,22 +191,23 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> */
> static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> + return vma_test_any_mask(vma,
> + mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
> }
>
> static inline bool userfaultfd_missing(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & VM_UFFD_MISSING;
> + return vma_test_any_mask(vma, VMA_UFFD_MISSING);
> }
>
> static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & VM_UFFD_WP;
> + return vma_test_any_mask(vma, VMA_UFFD_WP);
> }
>
> static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & VM_UFFD_MINOR;
> + return vma_test_any_mask(vma, VMA_UFFD_MINOR);
> }
>
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> @@ -222,7 +224,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
>
> static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & __VM_UFFD_FLAGS;
> + return vma_test_any_mask(vma, __VMA_UFFD_FLAGS);
> }
>
> static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
> --
> 2.54.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v6 14/15] selftests/mm: add userfaultfd RWP tests
2026-05-29 17:26 ` [PATCH v6 14/15] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau (Meta)
@ 2026-06-02 22:18 ` Askar Safin
0 siblings, 0 replies; 21+ messages in thread
From: Askar Safin @ 2026-06-02 22:18 UTC (permalink / raw)
To: kas
Cc: Liam.Howlett, aarcange, akpm, corbet, david, jthoughton,
kernel-team, kvm, linux-doc, linux-kernel, linux-kselftest,
linux-mm, ljs, pbonzini, peterx, rppt, seanjc, sj, skhan, surenb,
usama.arif, vbabka, ziy
"Kiryl Shutsemau (Meta)" <kas@kernel.org>:
> +/*
> + * Test that GUP resolves through protnone PTEs (async mode).
> + * vmsplice() into a pipe pins user pages via get_user_pages_fast() --
> + * unlike write(), which goes through copy_from_user() and ordinary
> + * hardware page faults -- so it exercises gup_can_follow_protnone() on
> + * the RW-protected PTE. In async mode the kernel auto-restores
> + * permissions and GUP returns the page.
> + */
Note that I recently submitted a patch, which makes vmsplice equivalent to
preadv2/pwritev2, and it was accepted to next.
For now it is just an experiment, it is possible it will be reverted.
https://lore.kernel.org/all/20260601-aufweichen-dissens-ausrechnen-0d9b84728113@brauner/
--
Askar Safin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v6 05/15] mm: add VM_UFFD_RWP VMA flag
2026-05-29 17:26 ` [PATCH v6 05/15] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau (Meta)
@ 2026-06-03 12:52 ` Lorenzo Stoakes
0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2026-06-03 12:52 UTC (permalink / raw)
To: Kiryl Shutsemau (Meta)
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team
On Fri, May 29, 2026 at 06:26:34PM +0100, Kiryl Shutsemau (Meta) wrote:
> Preparatory patch for userfaultfd read-write protection (RWP). RWP
> extends userfaultfd protection from plain write-protection (WP) to
> full read-write protection: accesses to an RWP-protected range --
> reads as well as writes -- trap through userfaultfd.
>
> Reserve VM_UFFD_RWP, add the userfaultfd_rwp() and
> userfaultfd_protected() helpers, and wire up the smaps "ur" entry and
> the trace-flag table the rest of the series will use. The flag is
> gated on CONFIG_USERFAULTFD_RWP, which is introduced together with the
> UAPI in a later patch; until then VM_UFFD_RWP aliases VM_NONE and
> every downstream check folds to dead code.
>
> Nothing sets or queries the flag yet.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Reviewed-by: SeongJae Park <sj@kernel.org>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> ---
> Documentation/filesystems/proc.rst | 1 +
> fs/proc/task_mmu.c | 3 +++
> include/linux/mm.h | 41 ++++++++++++++++++++----------
> include/linux/userfaultfd_k.h | 32 +++++++++++++++++++----
> include/trace/events/mmflags.h | 7 +++++
> 5 files changed, 65 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index db6167befb7b..db28207c5290 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -607,6 +607,7 @@ encoded manner. The codes are the following:
> um userfaultfd missing tracking
> uw userfaultfd wr-protect tracking
> ui userfaultfd minor fault
> + ur userfaultfd read-write-protect tracking
> ss shadow/guarded control stack page
> sl sealed
> lf lock on fault pages
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 939657aa334a..ca0f69b347e8 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1237,6 +1237,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> [ilog2(VM_UFFD_MINOR)] = "ui",
> #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> +#ifdef CONFIG_USERFAULTFD_RWP
> + [ilog2(VM_UFFD_RWP)] = "ur",
> +#endif
> #ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> [ilog2(VM_SHADOW_STACK)] = "ss",
> #endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 485df9c2dbdd..5ac31fbadeef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -353,6 +353,7 @@ enum {
> #endif
> DECLARE_VMA_BIT(UFFD_MINOR, 41),
> DECLARE_VMA_BIT(SEALED, 42),
> + DECLARE_VMA_BIT(UFFD_RWP, 43),
> /* Flags that reuse flags above. */
> DECLARE_VMA_BIT_ALIAS(PKEY_BIT0, HIGH_ARCH_0),
> DECLARE_VMA_BIT_ALIAS(PKEY_BIT1, HIGH_ARCH_1),
> @@ -496,12 +497,17 @@ enum {
> #else
> #define VM_UFFD_MINOR VM_NONE
> #endif
> +#ifdef CONFIG_USERFAULTFD_RWP
> +#define VM_UFFD_RWP INIT_VM_FLAG(UFFD_RWP)
> +#else
> +#define VM_UFFD_RWP VM_NONE
> +#endif
>
> /*
> - * vma_flags_t masks for the userfaultfd VMA flags. VMA_UFFD_MINOR is gated on
> - * the same config as VM_UFFD_MINOR -- which implies 64BIT, where the bit fits
> - * -- so an out-of-range bit is never fed to mk_vma_flags() on a build whose
> - * bitmap cannot hold it.
> + * vma_flags_t masks for the userfaultfd VMA flags. The two high-bit modes are
> + * gated on the same configs as their VM_* flags above -- both of which imply
> + * 64BIT -- so an out-of-range bit is never fed to mk_vma_flags() on a build
> + * whose bitmap cannot hold it.
> */
> #define VMA_UFFD_MISSING mk_vma_flags(VMA_UFFD_MISSING_BIT)
> #define VMA_UFFD_WP mk_vma_flags(VMA_UFFD_WP_BIT)
> @@ -510,6 +516,11 @@ enum {
> #else
> #define VMA_UFFD_MINOR EMPTY_VMA_FLAGS
> #endif
> +#ifdef CONFIG_USERFAULTFD_RWP
> +#define VMA_UFFD_RWP mk_vma_flags(VMA_UFFD_RWP_BIT)
> +#else
> +#define VMA_UFFD_RWP EMPTY_VMA_FLAGS
> +#endif
>
> #ifdef CONFIG_64BIT
> #define VM_ALLOW_ANY_UNCACHED INIT_VM_FLAG(ALLOW_ANY_UNCACHED)
> @@ -648,22 +659,24 @@ enum {
> * reconsistuted upon page fault, so necessitate page table copying upon fork.
> *
> * Note that these flags should be compared with the DESTINATION VMA not the
> - * source, as VM_UFFD_WP may not be propagated to destination, while all other
> - * flags will be.
> + * source: VM_UFFD_WP and VM_UFFD_RWP may be cleared on the destination
> + * (dup_userfaultfd() -> userfaultfd_reset_ctx() when the parent context did
> + * not negotiate UFFD_FEATURE_EVENT_FORK), while all other flags propagate.
> *
> * VM_PFNMAP / VM_MIXEDMAP - These contain kernel-mapped data which cannot be
> * reasonably reconstructed on page fault.
> *
> * VM_UFFD_WP - Encodes metadata about an installed uffd
> - * write protect handler, which cannot be
> - * reconstructed on page fault.
> + * VM_UFFD_RWP write- or read-write-protect handler, which
> + * cannot be reconstructed on page fault.
> *
> - * We always copy pgtables when dst_vma has uffd-wp
> - * enabled even if it's file-backed
> - * (e.g. shmem). Because when uffd-wp is enabled,
> - * pgtable contains uffd-wp protection information,
> - * that's something we can't retrieve from page cache,
> - * and skip copying will lose those info.
> + * We always copy pgtables when dst_vma has the
> + * uffd PTE bit in use even if it's file-backed
> + * (e.g. shmem). Because when the uffd bit is
> + * in use, the pgtable contains the protection
> + * information, that's something we can't
> + * retrieve from page cache, and skip copying
> + * will lose those info.
> *
> * VM_MAYBE_GUARD - Could contain page guard region markers which
> * by design are a property of the page tables
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index c4f2cc6dfcf0..f3b2db27989b 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -21,10 +21,11 @@
> #include <linux/hugetlb_inline.h>
>
> /* The set of all possible UFFD-related VM flags. */
> -#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_MINOR | \
> + VM_UFFD_WP | VM_UFFD_RWP)
>
> #define __VMA_UFFD_FLAGS mk_vma_flags_from_masks(VMA_UFFD_MISSING, VMA_UFFD_WP, \
> - VMA_UFFD_MINOR)
> + VMA_UFFD_MINOR, VMA_UFFD_RWP)
>
> /*
> * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> @@ -179,7 +180,8 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> {
> return vma_test_any_mask(vma,
> - mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
> + mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR,
> + VMA_UFFD_RWP));
Wonder if a vma_test_any_masks() is worth adding now :) [will do separately though]
> }
>
> /*
> @@ -210,6 +212,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> return vma_test_any_mask(vma, VMA_UFFD_MINOR);
> }
>
> +static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
> +{
> + return vma_test_any_mask(vma, VMA_UFFD_RWP);
NIT: You could use vma_flags_test_single_mask() just to make it clear it's a single
flag. It's not a big deal though really.
> +}
> +
> +static inline bool userfaultfd_protected(struct vm_area_struct *vma)
> +{
> + return userfaultfd_wp(vma) || userfaultfd_rwp(vma);
> +}
> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> @@ -330,6 +342,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> return false;
> }
>
> +static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
> +{
> + return false;
> +}
> +
> +static inline bool userfaultfd_protected(struct vm_area_struct *vma)
> +{
> + return false;
> +}
> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> @@ -423,8 +445,8 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
> }
>
> /*
> - * Returns true if this is a swap pte and was uffd-wp wr-protected in either
> - * forms (pte marker or a normal swap pte), false otherwise.
> + * Returns true if this swap pte carries uffd-tracked state in either
> + * form (pte marker or a normal swap pte), false otherwise.
> */
> static inline bool pte_swp_uffd_any(pte_t pte)
> {
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index a6e5a44c9b42..bfface3d0203 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -194,6 +194,12 @@ IF_HAVE_PG_ARCH_3(arch_3)
> # define IF_HAVE_UFFD_MINOR(flag, name)
> #endif
>
> +#ifdef CONFIG_USERFAULTFD_RWP
> +# define IF_HAVE_UFFD_RWP(flag, name) {flag, name},
> +#else
> +# define IF_HAVE_UFFD_RWP(flag, name)
> +#endif
> +
> #if defined(CONFIG_64BIT) || defined(CONFIG_PPC32)
> # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
> #else
> @@ -215,6 +221,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \
> {VM_PFNMAP, "pfnmap" }, \
> {VM_MAYBE_GUARD, "maybe_guard" }, \
> {VM_UFFD_WP, "uffd_wp" }, \
> +IF_HAVE_UFFD_RWP(VM_UFFD_RWP, "uffd_rwp" ) \
> {VM_LOCKED, "locked" }, \
> {VM_IO, "io" }, \
> {VM_SEQ_READ, "seqread" }, \
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API
2026-05-29 17:26 ` [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API Kiryl Shutsemau (Meta)
2026-06-02 10:07 ` Mike Rapoport
@ 2026-06-03 12:54 ` Lorenzo Stoakes
1 sibling, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2026-06-03 12:54 UTC (permalink / raw)
To: Kiryl Shutsemau (Meta)
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team
On Fri, May 29, 2026 at 06:26:33PM +0100, Kiryl Shutsemau (Meta) wrote:
> The uffd VMA-flag helpers read vma->vm_flags directly. Now that
> config-gated per-mode masks exist, switch them to the vma_flags_t
> accessor vma_test_any_mask(), which is the going-forward API and keeps a
> single place (the VMA_UFFD_* masks) that knows which modes are available
> on the current build.
>
> No functional change: vma_flags_t is in union with vm_flags, so the same
> bits are read, and the masks fold to the same code the open-coded
> vm_flags tests produced -- verified identical on gcc and clang, 32- and
> 64-bit.
>
> Suggested-by: Lorenzo Stoakes <ljs@kernel.org>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Thanks, LGTM so:
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> Assisted-by: Claude:claude-opus-4-8
> ---
> include/linux/userfaultfd_k.h | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 658740df2978..c4f2cc6dfcf0 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -178,7 +178,8 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> */
> static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> + return vma_test_any_mask(vma,
> + mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
> }
>
> /*
> @@ -190,22 +191,23 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> */
> static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> + return vma_test_any_mask(vma,
> + mk_vma_flags_from_masks(VMA_UFFD_WP, VMA_UFFD_MINOR));
> }
>
> static inline bool userfaultfd_missing(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & VM_UFFD_MISSING;
> + return vma_test_any_mask(vma, VMA_UFFD_MISSING);
> }
>
> static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & VM_UFFD_WP;
> + return vma_test_any_mask(vma, VMA_UFFD_WP);
> }
>
> static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & VM_UFFD_MINOR;
> + return vma_test_any_mask(vma, VMA_UFFD_MINOR);
> }
>
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> @@ -222,7 +224,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
>
> static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & __VM_UFFD_FLAGS;
> + return vma_test_any_mask(vma, __VMA_UFFD_FLAGS);
> }
>
> static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v6 08/15] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
2026-05-29 17:26 ` [PATCH v6 08/15] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau (Meta)
@ 2026-06-03 12:57 ` Lorenzo Stoakes
0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2026-06-03 12:57 UTC (permalink / raw)
To: Kiryl Shutsemau (Meta)
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team
On Fri, May 29, 2026 at 06:26:37PM +0100, Kiryl Shutsemau (Meta) wrote:
> Three mm paths outside the fault handler gate on the uffd PTE bit
> today: khugepaged (skip collapse on ranges carrying markers), rmap
> (cap unmap batching), and GUP (force a fault through
> gup_can_follow_protnone). Extend each to treat VM_UFFD_RWP the same
> as VM_UFFD_WP; otherwise per-PTE RWP state is silently destroyed or
> bypassed.
>
> khugepaged: try_collapse_pte_mapped_thp() and
> file_backed_vma_is_retractable() already refuse to collapse or
> retract page tables on ranges carrying the uffd PTE bit. Broaden the
> VMA predicate from userfaultfd_wp() to userfaultfd_protected() so
> VM_UFFD_RWP ranges get the same protection. hpage_collapse_scan_pmd()
> needs no change — its existing pte_uffd() check already catches an
> RWP PTE because it carries the uffd bit.
>
> rmap: folio_unmap_pte_batch() caps batching at 1 for VM_UFFD_RWP so
> the restore path handles each PTE with its own marker.
>
> GUP: gup_can_follow_protnone() forces a fault on VM_UFFD_RWP VMAs
> regardless of FOLL_HONOR_NUMA_FAULT. RWP uses protnone as an
> access-tracking marker, not for NUMA hinting, so any GUP — read or
> write — must go through the userfaultfd fault path.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Assisted-by: Claude:claude-opus-4-6
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Nit below but LGTM, so:
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> ---
> include/linux/mm.h | 16 +++++++++++++++-
> mm/khugepaged.c | 18 +++++++++++-------
> mm/rmap.c | 2 +-
> 3 files changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3d4d5f9a6f1b..2b04f690b516 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4644,11 +4644,25 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
>
> /*
> * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
> - * a (NUMA hinting) fault is required.
> + * a (NUMA hinting or userfaultfd RWP) fault is required.
> */
> static inline bool gup_can_follow_protnone(const struct vm_area_struct *vma,
> unsigned int flags)
> {
> + /*
> + * VM_UFFD_RWP uses protnone as an access-tracking marker, not for
> + * NUMA hinting. GUP must always take a fault so the access is
> + * delivered to userfaultfd, regardless of FOLL_HONOR_NUMA_FAULT.
> + *
> + * Only do so while the VMA is accessible. If it has been made
> + * inaccessible (e.g. mprotect(PROT_NONE)), fall through to the guard
> + * below: forcing a fault there would loop, as handle_mm_fault() makes
> + * no progress on protnone in an inaccessible VMA, and the access is
> + * denied regardless of RWP anyway.
> + */
> + if ((vma->vm_flags & VM_UFFD_RWP) && vma_is_accessible(vma))
> + return false;
Can be:
if (vma_test_single_mask(vma, VMA_UFFD_RWP) && vma_is_accessible(vma))
return false;
> +
> /*
> * If callers don't want to honor NUMA hinting faults, no need to
> * determine if we would actually have to trigger a NUMA hinting fault.
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index afa218be15de..4f3fedcd75cf 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1895,8 +1895,11 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign
> if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> return SCAN_VMA_CHECK;
>
> - /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> - if (userfaultfd_wp(vma))
> + /*
> + * Keep pmd pgtable while the uffd bit is in use; see comment in
> + * retract_page_tables().
> + */
> + if (userfaultfd_protected(vma))
> return SCAN_PTE_UFFD;
>
> folio = filemap_lock_folio(vma->vm_file->f_mapping,
> @@ -2109,13 +2112,14 @@ static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
> return false;
>
> /*
> - * When a vma is registered with uffd-wp, we cannot recycle
> + * When a vma is registered with uffd-wp or RWP, we cannot recycle
> * the page table because there may be pte markers installed.
> - * Other vmas can still have the same file mapped hugely, but
> - * skip this one: it will always be mapped in small page size
> - * for uffd-wp registered ranges.
> + * VM_UFFD_RWP ranges similarly rely on per-PTE uffd state
> + * and cannot be recycled to a shared PMD. Other vmas can still
> + * have the same file mapped hugely, but skip this one: it will
> + * always be mapped in small page size for these registrations.
> */
> - if (userfaultfd_wp(vma))
> + if (userfaultfd_protected(vma))
> return false;
>
> /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 546bc1cf9391..9fb733489898 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1965,7 +1965,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> if (pte_unused(pte))
> return 1;
>
> - if (userfaultfd_wp(vma))
> + if (userfaultfd_protected(vma))
> return 1;
>
> /*
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2026-06-03 12:57 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29 17:26 [PATCH v6 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 01/15] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 02/15] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 03/15] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API Kiryl Shutsemau (Meta)
2026-06-02 10:07 ` Mike Rapoport
2026-06-03 12:54 ` Lorenzo Stoakes
2026-05-29 17:26 ` [PATCH v6 05/15] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau (Meta)
2026-06-03 12:52 ` Lorenzo Stoakes
2026-05-29 17:26 ` [PATCH v6 06/15] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 07/15] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 08/15] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau (Meta)
2026-06-03 12:57 ` Lorenzo Stoakes
2026-05-29 17:26 ` [PATCH v6 09/15] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 10/15] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 11/15] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 12/15] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 13/15] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau (Meta)
2026-05-29 17:26 ` [PATCH v6 14/15] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau (Meta)
2026-06-02 22:18 ` Askar Safin
2026-05-29 17:26 ` [PATCH v6 15/15] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau (Meta)
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.