* [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory
@ 2026-05-26 13:04 Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 01/18] fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race Kiryl Shutsemau
` (17 more replies)
0 siblings, 18 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
This series adds userfaultfd support for tracking the working set of
VM guest memory, so a VMM can identify cold pages and evict them to
tiered or remote storage.
v1: https://lore.kernel.org/all/20260427114607.4068647-1-kas@kernel.org/
v2: https://lore.kernel.org/all/cover.1778254670.git.kas@kernel.org/
v3: https://lore.kernel.org/all/20260522133857.552279-1-kirill@shutemov.name/
v4: https://lore.kernel.org/all/20260525113737.1942478-1-kas@kernel.org/
== Changes since v4 ==
v5 mainly addresses Sashiko AI review feedback on v4.
- Patches 1-4 are new pre-existing fixes surfaced by that review
(each carries Fixes:/Cc: stable@).
- 05/18: LoongArch select of ARCH_HAS_PTE_PROTNONE gated on 64BIT.
- 10/18, 13/18: gate RWP disarm/rebuild paths on pte_uffd() so
NUMA-balancing PROT_NONE markers survive.
- 12/18: reject UFFDIO_REGISTER_MODE_RWP on PROT_NONE VMAs.
- 14/18: PM_SCAN_WP_MATCHING on a VM_UFFD_RWP VMA silently skips
instead of -EINVAL, preserving the atomic read-and-reset.
- 16/18: UFFDIO_SET_MODE feature check goes through
userfaultfd_features() for KCSAN clean read.
- 17/18: drop _UFFDIO_SET_MODE from the baseline UFFDIO_API check
so the test still passes on older kernels.
- 18/18: VMM example switches to sync before PAGEMAP_SCAN; names
uffdio_api so callers can read back negotiated features; fixes
a couple of stray identifiers in the eviction loop.
113/113 of tools/testing/selftests/mm/uffd-unit-tests pass (46 new
RWP cases + existing UFFD groups, no regressions from patches 1-4).
== Problem ==
A VMM managing guest memory needs to:
1. detect which pages are still being touched (working-set
tracking);
2. safely evict cold pages to slower tiered or remote storage;
3. fetch them back on demand when accessed again.
== Approach ==
UFFDIO_REGISTER_MODE_RWP is a new userfaultfd registration mode, in
parallel with the existing MODE_MISSING / MODE_WP / MODE_MINOR. It
uses the same mechanism on every backing -- anon, shmem, hugetlbfs:
- PAGE_NONE on the PTE (the same primitive NUMA balancing uses)
makes the page inaccessible while keeping it resident;
- the uffd PTE bit (the one MODE_WP already owns) marks the entry
as "userfaultfd-tracked" so the protnone fault path can tell an
RWP fault apart from an mprotect(PROT_NONE) or NUMA hinting
fault.
VM_UFFD_WP and VM_UFFD_RWP are mutually exclusive per VMA, so the
same PTE bit safely carries both meanings depending on the
registered VMA flag.
In sync mode, the kernel delivers a UFFD_PAGEFAULT_FLAG_RWP message
to the registered handler, and the handler resolves the fault with
UFFDIO_RWPROTECT clearing MODE_RWP. In async mode
(UFFD_FEATURE_RWP_ASYNC), the fault is auto-resolved in-place: the
kernel restores the original PTE permissions and the faulting thread
continues without a userfaultfd message ever being delivered.
Userspace then learns which pages were touched by reading
PAGE_IS_ACCESSED out of PAGEMAP_SCAN -- pages whose uffd bit is
still set were not re-accessed since the last RWP cycle.
UFFDIO_RWPROTECT is the protect/unprotect ioctl, mirroring
UFFDIO_WRITEPROTECT.
UFFDIO_SET_MODE flips RWP_ASYNC <-> sync at runtime under
mmap_write_lock() + vma_start_write(), so a VMM can run in async
mode for detection and switch to sync for race-free eviction without
re-registering the userfaultfd.
== Typical VMM workflow ==
/* arm */
UFFDIO_API(features = RWP | RWP_ASYNC)
UFFDIO_REGISTER(MODE_RWP)
/* detection cycle (async) */
UFFDIO_RWPROTECT(range, RWP)
sleep(interval)
/* freeze the cold snapshot before scanning */
UFFDIO_SET_MODE(disable = RWP_ASYNC) /* sync */
PAGEMAP_SCAN(!PAGE_IS_ACCESSED) -> cold pages
/* eviction (sync mode traps races) */
pwrite(cold) + fallocate(FALLOC_FL_PUNCH_HOLE, cold)
UFFDIO_WAKE(cold)
UFFDIO_SET_MODE(enable = RWP_ASYNC) /* resume */
== Series layout ==
Patches 1 to 4 are independent pre-existing fixes (Fixes:/Cc: stable@)
that the RWP code shares paths with -- they can be picked separately
if needed:
1: fs/proc/task_mmu: huge make_uffd_wp_huge_pte() prot-update race
-- missing huge_ptep_modify_prot_start() can lose hardware
Dirty/Accessed updates.
2: mm/huge_memory: change_non_present_huge_pmd() drops
pmd_swp_uffd_wp on the writable -> readable device-private
PMD rewrite; plain mprotect() silently strips the marker.
3: userfaultfd: must_wait() applies pte_write() to a locklessly
read PTE without checking pte_present() -- swap/migration
entries decode random offset bits and the thread can stay
parked on a stale fault.
4: mm: mk_vma_flags() OOBs into the first word of vma_flags_t on
32-bit when called with a bit >= BITS_PER_LONG. Harmless by
coincidence today (the wraparound lands on a bit that's
already in the mask), but any future high-numbered bit would
silently corrupt the result. Add VMA_NO_BIT and skip negative
bits in DECLARE_VMA_BIT().
Patches 5 to 7 are preparatory:
5: decouple protnone helpers from CONFIG_NUMA_BALANCING.
6-7: rename _PAGE_BIT_UFFD_WP, pte_uffd_wp() and friends to drop
the _WP suffix, since the bit now carries WP and RWP meaning
depending on the VMA flag. The SCAN_PTE_UFFD enum's ftrace
output string is intentionally kept as "pte_uffd_wp" so
trace-based tooling does not silently break.
Patches 8 to 11 add the in-kernel mechanism:
8: VM_UFFD_RWP VMA flag (aliased to VM_NONE until 12/18 introduces
CONFIG_USERFAULTFD_RWP together with the UAPI).
9: MM_CP_UFFD_RWP change_protection() primitive (PAGE_NONE +
uffd bit, plus a RESOLVE counterpart).
10: marker preservation across swap, device-exclusive, migration,
fork, mremap, UFFDIO_MOVE, hugetlb copy, and mprotect().
11: handle VM_UFFD_RWP in khugepaged, rmap, and GUP.
Patches 12 to 16 wire the userspace surface:
12: UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing
(introduces CONFIG_USERFAULTFD_RWP).
13: RWP fault delivery and exposure of UFFDIO_REGISTER_MODE_RWP.
14: PAGE_IS_ACCESSED in PAGEMAP_SCAN.
15: UFFD_FEATURE_RWP_ASYNC for async fault resolution.
16: UFFDIO_SET_MODE for runtime sync/async toggle.
Patches 17 and 18 are kernel tests and Documentation/. Matching
userfaultfd(2) and ioctl_userfaultfd(2) man-page updates will be
sent as a separate patchset against the kernel.org linux-man tree.
Kiryl Shutsemau (Meta) (18):
fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race
mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD
downgrade
userfaultfd: gate must_wait writability check on pte_present()
mm: skip out-of-range bits in mk_vma_flags()
mm: decouple protnone helpers from CONFIG_NUMA_BALANCING
mm: rename uffd-wp PTE bit macros to uffd
mm: rename uffd-wp PTE accessors to uffd
mm: add VM_UFFD_RWP VMA flag
mm: add MM_CP_UFFD_RWP change_protection() flag
mm: preserve RWP marker across PTE rewrites
mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT
plumbing
mm/userfaultfd: add RWP fault delivery and expose
UFFDIO_REGISTER_MODE_RWP
mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking
userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution
userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
selftests/mm: add userfaultfd RWP tests
Documentation/userfaultfd: document RWP working set tracking
Documentation/admin-guide/mm/pagemap.rst | 13 +-
Documentation/admin-guide/mm/userfaultfd.rst | 248 +++++-
Documentation/filesystems/proc.rst | 1 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable-prot.h | 8 +-
arch/arm64/include/asm/pgtable.h | 47 +-
arch/loongarch/Kconfig | 1 +
arch/loongarch/include/asm/pgtable.h | 4 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 8 +-
arch/powerpc/platforms/Kconfig.cputype | 1 +
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/pgtable-bits.h | 12 +-
arch/riscv/include/asm/pgtable.h | 59 +-
arch/s390/Kconfig | 1 +
arch/s390/include/asm/hugetlb.h | 12 +-
arch/s390/include/asm/pgtable.h | 4 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 56 +-
arch/x86/include/asm/pgtable_types.h | 16 +-
fs/proc/task_mmu.c | 120 ++-
include/asm-generic/hugetlb.h | 18 +-
include/asm-generic/pgtable_uffd.h | 32 +-
include/linux/huge_mm.h | 7 +
include/linux/leafops.h | 4 +-
include/linux/mm.h | 61 +-
include/linux/mm_inline.h | 4 +-
include/linux/pgtable.h | 32 +-
include/linux/swapops.h | 4 +-
include/linux/userfaultfd_k.h | 76 +-
include/trace/events/huge_memory.h | 2 +-
include/trace/events/mmflags.h | 7 +
include/uapi/linux/fs.h | 1 +
include/uapi/linux/userfaultfd.h | 54 +-
init/Kconfig | 8 +
mm/Kconfig | 9 +
mm/debug_vm_pgtable.c | 4 +-
mm/huge_memory.c | 157 +++-
mm/hugetlb.c | 158 +++-
mm/internal.h | 4 +-
mm/khugepaged.c | 40 +-
mm/memory.c | 133 +++-
mm/migrate.c | 20 +-
mm/migrate_device.c | 8 +-
mm/mprotect.c | 68 +-
mm/mremap.c | 17 +-
mm/page_table_check.c | 8 +-
mm/rmap.c | 18 +-
mm/swapfile.c | 9 +-
mm/userfaultfd.c | 407 +++++++++-
tools/include/uapi/linux/fs.h | 1 +
tools/testing/selftests/mm/uffd-unit-tests.c | 765 +++++++++++++++++++
51 files changed, 2321 insertions(+), 429 deletions(-)
base-commit: 449a5df98f8dffa9b037e3b6838fc5af327df072
--
2.54.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v5 01/18] fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 02/18] mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade Kiryl Shutsemau
` (16 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta), stable, Sashiko AI review
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
make_uffd_wp_huge_pte() arms the UFFD_WP bit on a present HugeTLB PTE by
calling huge_ptep_modify_prot_commit() with a ptent snapshot that was
fetched without the corresponding huge_ptep_modify_prot_start(). The
start helper is what atomically clears the entry so the kernel-owned
snapshot stays consistent until the commit; without it, the hardware
may set Dirty or Accessed in the live PTE between the original read
and the commit, and huge_ptep_modify_prot_commit() (whose generic
implementation just calls set_huge_pte_at()) then writes the stale
snapshot back over the live hardware bits, losing the update.
The non-hugetlb sibling make_uffd_wp_pte() does this correctly via
ptep_modify_prot_start() / ptep_modify_prot_commit(). Mirror that
pattern for the present-PTE branch. The migration case stays as-is --
migration entries are non-present, so there's no hardware update to
race against.
Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Cc: stable@vger.kernel.org
Reported-by: Sashiko AI review <sashiko-bot@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
fs/proc/task_mmu.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1e3a15bf46f4..e21a38ac745b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2610,12 +2610,16 @@ static void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
if (softleaf_is_hwpoison(entry) || softleaf_is_marker(entry))
return;
- if (softleaf_is_migration(entry))
+ if (softleaf_is_migration(entry)) {
set_huge_pte_at(vma->vm_mm, addr, ptep,
pte_swp_mkuffd_wp(ptent), psize);
- else
- huge_ptep_modify_prot_commit(vma, addr, ptep, ptent,
- huge_pte_mkuffd_wp(ptent));
+ } else {
+ pte_t old_pte, new_pte;
+
+ old_pte = huge_ptep_modify_prot_start(vma, addr, ptep);
+ new_pte = huge_pte_mkuffd_wp(old_pte);
+ huge_ptep_modify_prot_commit(vma, addr, ptep, old_pte, new_pte);
+ }
}
#endif /* CONFIG_HUGETLB_PAGE */
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 02/18] mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 01/18] fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 03/18] userfaultfd: gate must_wait writability check on pte_present() Kiryl Shutsemau
` (15 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta), stable, Sashiko AI review
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
change_non_present_huge_pmd() rewrites a writable device-private PMD
swap entry into a readable one without carrying pmd_swp_uffd_wp()
across. The PTE-level change_softleaf_pte() does this correctly;
mirror that here, matching what copy_huge_pmd() does for the fork
path. Without the carry, a plain mprotect() over a UFFD_WP-marked
device-private THP strips the bit and the trap is bypassed on
swap-in.
Fixes: 368076f52ebe ("mm/huge_memory: add device-private THP support to PMD operations")
Cc: stable@vger.kernel.org
Reported-by: Sashiko AI review <sashiko-bot@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
mm/huge_memory.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 42b86e8ab7c0..b7c895b1d366 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2663,6 +2663,8 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
} else if (softleaf_is_device_private_write(entry)) {
entry = make_readable_device_private_entry(swp_offset(entry));
newpmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_uffd_wp(*pmd))
+ newpmd = pmd_swp_mkuffd_wp(newpmd);
} else {
newpmd = *pmd;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 03/18] userfaultfd: gate must_wait writability check on pte_present()
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 01/18] fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 02/18] mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 04/18] mm: skip out-of-range bits in mk_vma_flags() Kiryl Shutsemau
` (14 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta), stable, Sashiko AI review
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
userfaultfd_must_wait() and userfaultfd_huge_must_wait() read the PTE
without taking the page table lock and then apply pte_write() /
huge_pte_write() to it. Those accessors decode bits from the present
encoding only; on a swap or migration entry they read the offset bits
that happen to share the same position and return an undefined result.
The intent of the check is "is this fault still WP-blocked?". A
non-marker swap entry means the page is in transit -- the userfault
context the original fault delivered against is no longer the same,
and the swap-in or migration completion path will re-deliver a fresh
fault if userspace still needs to handle it. Worst case under the
current code the garbage write bit says "wait", and the thread stays
asleep until a UFFDIO_WAKE that may never arrive.
Gate the writability check on pte_present() so the lockless re-check
only inspects present-PTE bits when the entry is actually present.
The non-present, non-marker case returns "don't wait" and lets the
fault path retry.
Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges")
Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Cc: stable@vger.kernel.org
Reported-by: Sashiko AI review <sashiko-bot@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
mm/userfaultfd.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 35b206cc9aa6..f6d2a1c67019 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2535,6 +2535,15 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
/* UFFD PTE markers require userspace to resolve the fault. */
if (pte_is_uffd_marker(pte))
return true;
+ /*
+ * Concurrent migration may have replaced the present PTE with a
+ * non-marker swap entry between fault delivery and this lockless
+ * re-check. huge_pte_write() on a swap entry decodes random offset
+ * bits, so gate it on pte_present(). The migration completion path
+ * will re-deliver the fault if it still needs userspace.
+ */
+ if (!pte_present(pte))
+ return false;
/*
* If VMA has UFFD WP faults enabled and WP fault, wait for userspace to
* resolve the fault.
@@ -2621,6 +2630,17 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
/* UFFD PTE markers require userspace to resolve the fault. */
if (pte_is_uffd_marker(ptent))
goto out;
+ /*
+ * Concurrent swap-out / migration may have replaced the present PTE
+ * with a non-marker swap entry between fault delivery and this
+ * lockless re-check. pte_write() on a swap entry decodes random
+ * offset bits, so gate it on pte_present(). The page-in path will
+ * re-deliver the fault if it still needs userspace.
+ */
+ if (!pte_present(ptent)) {
+ ret = false;
+ goto out;
+ }
/*
* If VMA has UFFD WP faults enabled and WP fault, wait for userspace to
* resolve the fault.
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 04/18] mm: skip out-of-range bits in mk_vma_flags()
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (2 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 03/18] userfaultfd: gate must_wait writability check on pte_present() Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-29 14:00 ` Lorenzo Stoakes
2026-05-26 13:04 ` [PATCH v5 05/18] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau
` (13 subsequent siblings)
17 siblings, 1 reply; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta), stable
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
vma_flags_t is one unsigned long on 32-bit -- NUM_VMA_FLAG_BITS ==
BITS_PER_LONG by design, so VM_xxx-declared bits sit in the first
word and hit the single-long fast path. But the bit enum declares
some bits unconditionally above BITS_PER_LONG (VMA_UFFD_MINOR_BIT
== 41 today, with VM_UFFD_MINOR == VM_NONE on 32-bit so no VMA
actually carries the bit).
Passing such a bit to mk_vma_flags() goes through __set_bit(41,
&one_long) and writes one word past the end. The compiler folds
the OOB store with wraparound (1UL << (41 % 32) == bit 9) into
the first word. Bit 9 is already in __VMA_UFFD_FLAGS so the mask
happens to come out right today, but any high-numbered bit whose
mod-BITS_PER_LONG position is otherwise unused would silently OR
an extra bit into the mask.
Add VMA_NO_BIT and have DECLARE_VMA_BIT() resolve any bitnum out
of range to it. vma_flags_set_flag() drops negative bit values.
The ternary collapses at compile time, the runtime check folds
away when the bit is in range, and the common path is unchanged.
Bits declared in the enum are now safe to pass to mk_vma_flags()
regardless of arch.
Fixes: 9ea35a25d51b ("mm: introduce VMA flags bitmap type")
Cc: stable@vger.kernel.org
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/mm.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f2612a70fb1..71b11945e4fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -286,8 +286,17 @@ extern unsigned int kobjsize(const void *objp);
*/
typedef int __bitwise vma_flag_t;
-#define DECLARE_VMA_BIT(name, bitnum) \
- VMA_ ## name ## _BIT = ((__force vma_flag_t)bitnum)
+/*
+ * VMA_NO_BIT means "no bit"; mk_vma_flags() skips it. DECLARE_VMA_BIT()
+ * below uses it for any bit number that doesn't fit in the bitmap, so
+ * callers don't need to track which bits are valid on the current build.
+ */
+#define VMA_NO_BIT ((__force vma_flag_t)-1)
+
+#define DECLARE_VMA_BIT(name, bitnum) \
+ VMA_ ## name ## _BIT = (((bitnum) < NUM_VMA_FLAG_BITS) ? \
+ ((__force vma_flag_t)(bitnum)) : \
+ VMA_NO_BIT)
#define DECLARE_VMA_BIT_ALIAS(name, aliased) \
VMA_ ## name ## _BIT = (VMA_ ## aliased ## _BIT)
enum {
@@ -1081,6 +1090,8 @@ static __always_inline void vma_flags_set_flag(vma_flags_t *flags,
{
unsigned long *bitmap = flags->__vma_flags;
+ if ((__force int)bit < 0)
+ return;
__set_bit((__force int)bit, bitmap);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 05/18] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (3 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 04/18] mm: skip out-of-range bits in mk_vma_flags() Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 06/18] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau
` (12 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
pte_protnone() and pmd_protnone() detect present-but-inaccessible page
table entries. This capability is useful beyond NUMA balancing -- for
example, userfaultfd working set tracking uses protnone PTEs to track
page access without unmapping pages.
Introduce CONFIG_ARCH_HAS_PTE_PROTNONE to decouple the protnone PTE
infrastructure from CONFIG_NUMA_BALANCING. The six architectures that
support protnone PTEs (x86_64, arm64, powerpc, s390, riscv, loongarch)
now select this option, and CONFIG_NUMA_BALANCING depends on it.
No functional change -- the same set of architectures continues to have
working protnone support, but the infrastructure is now available
independently of NUMA balancing.
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 7 ++---
arch/loongarch/Kconfig | 1 +
arch/loongarch/include/asm/pgtable.h | 4 +--
arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++---
arch/powerpc/platforms/Kconfig.cputype | 1 +
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/pgtable.h | 7 ++---
arch/s390/Kconfig | 1 +
arch/s390/include/asm/pgtable.h | 4 +--
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 8 ++---
include/linux/pgtable.h | 32 ++++++++++++++------
init/Kconfig | 8 +++++
mm/debug_vm_pgtable.c | 4 +--
15 files changed, 52 insertions(+), 36 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..319470b3b1bb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -78,6 +78,7 @@ config ARM64
select ARCH_SUPPORTS_CFI
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+ select ARCH_HAS_PTE_PROTNONE
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
select ARCH_SUPPORTS_PER_VMA_LOCK
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4dfa42b7d053..873f4ea2e288 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -553,10 +553,7 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
-#ifdef CONFIG_NUMA_BALANCING
-/*
- * See the comment in include/linux/pgtable.h
- */
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
/*
@@ -575,7 +572,7 @@ static inline int pmd_protnone(pmd_t pmd)
{
return pte_protnone(pmd_pte(pmd));
}
-#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#define pmd_present(pmd) pte_present(pmd_pte(pmd))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 606597da46b8..c085f5067b3b 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -67,6 +67,7 @@ config LOONGARCH
select ARCH_SUPPORTS_LTO_CLANG
select ARCH_SUPPORTS_LTO_CLANG_THIN
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
+ select ARCH_HAS_PTE_PROTNONE if 64BIT
select ARCH_SUPPORTS_NUMA_BALANCING if NUMA
select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_RT
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 2a0b63ae421f..d295447a2763 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -619,7 +619,7 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline long pte_protnone(pte_t pte)
{
return (pte_val(pte) & _PAGE_PROTNONE);
@@ -629,7 +629,7 @@ static inline long pmd_protnone(pmd_t pmd)
{
return (pmd_val(pmd) & _PAGE_PROTNONE);
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
#define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..53a0c5892548 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -490,13 +490,13 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte)
}
#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) ==
cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
static inline bool pte_hw_valid(pte_t pte)
{
@@ -1067,12 +1067,12 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#endif
#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pmd_protnone(pmd_t pmd)
{
return pte_protnone(pmd_pte(pmd));
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#define pmd_write(pmd) pte_write(pmd_pte(pmd))
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..36b64a24cf30 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -87,6 +87,7 @@ config PPC_BOOK3S_64
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_SPLIT_PMD_PTLOCK
select ARCH_SUPPORTS_HUGETLBFS
+ select ARCH_HAS_PTE_PROTNONE
select ARCH_SUPPORTS_NUMA_BALANCING
select HAVE_MOVE_PMD
select HAVE_MOVE_PUD
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c5754942cf85..e2c5776d18cf 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -71,6 +71,7 @@ config RISCV
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS if 64BIT && MMU
select ARCH_SUPPORTS_PAGE_TABLE_CHECK if MMU
select ARCH_SUPPORTS_PER_VMA_LOCK if MMU
+ select ARCH_HAS_PTE_PROTNONE if MMU
select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_SCHED_MC if SMP
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index a1a7c6520a09..48a127323b21 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -524,10 +524,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
PAGE_SIZE)
#endif
-#ifdef CONFIG_NUMA_BALANCING
-/*
- * See the comment in include/asm-generic/pgtable.h
- */
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE)) == _PAGE_PROT_NONE;
@@ -537,7 +534,7 @@ static inline int pmd_protnone(pmd_t pmd)
{
return pte_protnone(pmd_pte(pmd));
}
-#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
/* Modify page protection bits */
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ecbcbb781e40..bc5bef08454b 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -151,6 +151,7 @@ config S390
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && CC_IS_CLANG
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
+ select ARCH_HAS_PTE_PROTNONE
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
select ARCH_SUPPORTS_PER_VMA_LOCK
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2c6cee8241e0..97241dea5573 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -842,7 +842,7 @@ static inline int pte_same(pte_t a, pte_t b)
return pte_val(a) == pte_val(b);
}
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return pte_present(pte) && !(pte_val(pte) & _PAGE_READ);
@@ -853,7 +853,7 @@ static inline int pmd_protnone(pmd_t pmd)
/* pmd_leaf(pmd) implies pmd_present(pmd) */
return pmd_leaf(pmd) && !(pmd_val(pmd) & _SEGMENT_ENTRY_READ);
}
-#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
static inline bool pte_swp_exclusive(pte_t pte)
{
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f3f7cb01d69d..9da1119e8ff6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -123,6 +123,7 @@ config X86
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_PAGE_TABLE_CHECK if X86_64
+ select ARCH_HAS_PTE_PROTNONE if X86_64
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <= 4096
select ARCH_SUPPORTS_CFI if X86_64
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2187e9cfcefa..c7f014cbf0a9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -985,11 +985,7 @@ static inline int pmd_present(pmd_t pmd)
return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
}
-#ifdef CONFIG_NUMA_BALANCING
-/*
- * These work without NUMA balancing but the kernel does not care. See the
- * comment in include/linux/pgtable.h
- */
+#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE
static inline int pte_protnone(pte_t pte)
{
return (pte_flags(pte) & (_PAGE_PROTNONE | _PAGE_PRESENT))
@@ -1001,7 +997,7 @@ static inline int pmd_protnone(pmd_t pmd)
return (pmd_flags(pmd) & (_PAGE_PROTNONE | _PAGE_PRESENT))
== _PAGE_PROTNONE;
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
static inline int pmd_none(pmd_t pmd)
{
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdd68ed3ae1a..b6516a11adfa 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -2052,18 +2052,26 @@ static inline int pud_trans_unstable(pud_t *pud)
return 0;
}
-#ifndef CONFIG_NUMA_BALANCING
+#ifndef CONFIG_ARCH_HAS_PTE_PROTNONE
/*
- * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
- * perfectly valid to indicate "no" in that case, which is why our default
- * implementation defaults to "always no".
+ * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It
+ * is perfectly valid to indicate "no" in that case, which is why our
+ * default implementation defaults to "always no".
*
- * In an accessible VMA, however, pte_protnone() reliably indicates PROT_NONE
- * page protection due to NUMA hinting. NUMA hinting faults only apply in
- * accessible VMAs.
+ * In an accessible VMA, pte_protnone() reliably indicates a present
+ * PROT_NONE page protection. Today the kernel uses such PTEs for two
+ * purposes: NUMA hinting faults, and userfaultfd RWP tracking on
+ * VM_UFFD_RWP VMAs. The two are distinguished by the uffd PTE bit and
+ * the VMA flag; see include/linux/userfaultfd_k.h.
*
- * So, to reliably identify PROT_NONE PTEs that require a NUMA hinting fault,
- * looking at the VMA accessibility is sufficient.
+ * So, to reliably identify PROT_NONE PTEs that require kernel handling,
+ * looking at the VMA accessibility (and the uffd bit on RWP VMAs) is
+ * sufficient.
+ *
+ * Architectures without CONFIG_ARCH_HAS_PTE_PROTNONE get the always-zero
+ * stubs below; PAGE_NONE references that survive to runtime fire the
+ * BUILD_BUG() fallback, since callers should have folded such paths to
+ * dead code via IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE).
*/
static inline int pte_protnone(pte_t pte)
{
@@ -2074,7 +2082,11 @@ static inline int pmd_protnone(pmd_t pmd)
{
return 0;
}
-#endif /* CONFIG_NUMA_BALANCING */
+
+#ifndef PAGE_NONE
+#define PAGE_NONE ({ BUILD_BUG(); (pgprot_t){0}; })
+#endif
+#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */
#endif /* CONFIG_MMU */
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308ae..58abb7f19206 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -944,6 +944,13 @@ config SCHED_PROXY_EXEC
endmenu
+#
+# For architectures that support present-but-inaccessible (PROT_NONE) page
+# table entries detectable via pte_protnone() / pmd_protnone():
+#
+config ARCH_HAS_PTE_PROTNONE
+ bool
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
@@ -1010,6 +1017,7 @@ config ARCH_WANT_NUMA_VARIABLE_LOCALITY
config NUMA_BALANCING
bool "Memory placement aware NUMA scheduler"
depends on ARCH_SUPPORTS_NUMA_BALANCING
+ depends on ARCH_HAS_PTE_PROTNONE
depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
depends on SMP && NUMA_MIGRATION && !PREEMPT_RT
help
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 23dc3ee09561..5e9f3a35f924 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -672,7 +672,7 @@ static void __init pte_protnone_tests(struct pgtable_debug_args *args)
{
pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot_none);
- if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE))
return;
pr_debug("Validating PTE protnone\n");
@@ -685,7 +685,7 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args)
{
pmd_t pmd;
- if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE))
return;
if (!has_transparent_hugepage())
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 06/18] mm: rename uffd-wp PTE bit macros to uffd
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (4 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 05/18] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 07/18] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau
` (11 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
The uffd-wp PTE bit is about to gain a second consumer: userfaultfd
RWP will use the same bit to mark access-tracking PTEs, distinct
from mprotect(PROT_NONE) or NUMA-hinting PTEs. WP vs RWP semantics
come from the VMA flag; the bit is just "uffd has claimed this
entry." Drop the "_wp" suffix from the arch-private bit macros so
they reflect that.
x86: _PAGE_BIT_UFFD_WP -> _PAGE_BIT_UFFD
_PAGE_UFFD_WP -> _PAGE_UFFD
_PAGE_SWP_UFFD_WP -> _PAGE_SWP_UFFD
arm64: PTE_UFFD_WP -> PTE_UFFD
PTE_SWP_UFFD_WP -> PTE_SWP_UFFD
riscv: _PAGE_UFFD_WP -> _PAGE_UFFD
_PAGE_SWP_UFFD_WP -> _PAGE_SWP_UFFD
Pure mechanical rename -- no behavior change.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
arch/arm64/include/asm/pgtable-prot.h | 8 ++++----
arch/arm64/include/asm/pgtable.h | 12 ++++++------
arch/riscv/include/asm/pgtable-bits.h | 12 ++++++------
arch/riscv/include/asm/pgtable.h | 14 +++++++-------
arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
arch/x86/include/asm/pgtable_types.h | 16 ++++++++--------
6 files changed, 43 insertions(+), 43 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 212ce1b02e15..09d7c00cf405 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -28,11 +28,11 @@
#define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
-#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
+#define PTE_UFFD (_AT(pteval_t, 1) << 58) /* userfaultfd tracking */
+#define PTE_SWP_UFFD (_AT(pteval_t, 1) << 3) /* only for swp ptes */
#else
-#define PTE_UFFD_WP (_AT(pteval_t, 0))
-#define PTE_SWP_UFFD_WP (_AT(pteval_t, 0))
+#define PTE_UFFD (_AT(pteval_t, 0))
+#define PTE_SWP_UFFD (_AT(pteval_t, 0))
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
#define _PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 873f4ea2e288..3eecb2c17711 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -343,17 +343,17 @@ static inline pmd_t pmd_mknoncont(pmd_t pmd)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & PTE_UFFD_WP);
+ return !!(pte_val(pte) & PTE_UFFD);
}
static inline pte_t pte_mkuffd_wp(pte_t pte)
{
- return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD_WP)));
+ return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD)));
}
static inline pte_t pte_clear_uffd_wp(pte_t pte)
{
- return clear_pte_bit(pte, __pgprot(PTE_UFFD_WP));
+ return clear_pte_bit(pte, __pgprot(PTE_UFFD));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -539,17 +539,17 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
- return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP));
+ return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
static inline int pte_swp_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & PTE_SWP_UFFD_WP);
+ return !!(pte_val(pte) & PTE_SWP_UFFD);
}
static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
- return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP));
+ return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
index b422d9691e60..d5a86b4df3ce 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -40,20 +40,20 @@
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-/* ext_svrsw60t59b: Bit(60) for uffd-wp tracking */
-#define _PAGE_UFFD_WP \
+/* ext_svrsw60t59b: Bit(60) for userfaultfd tracking */
+#define _PAGE_UFFD \
((riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)) ? \
(1UL << 60) : 0)
/*
* Bit 4 is not involved into swap entry computation, so we
- * can borrow it for swap page uffd-wp tracking.
+ * can borrow it for swap page userfaultfd tracking.
*/
-#define _PAGE_SWP_UFFD_WP \
+#define _PAGE_SWP_UFFD \
((riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)) ? \
_PAGE_USER : 0)
#else
-#define _PAGE_UFFD_WP 0
-#define _PAGE_SWP_UFFD_WP 0
+#define _PAGE_UFFD 0
+#define _PAGE_SWP_UFFD 0
#endif
#define _PAGE_TABLE _PAGE_PRESENT
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 48a127323b21..ca69948b3ed8 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -405,32 +405,32 @@ static inline pte_t pte_wrprotect(pte_t pte)
static inline bool pte_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & _PAGE_UFFD_WP);
+ return !!(pte_val(pte) & _PAGE_UFFD);
}
static inline pte_t pte_mkuffd_wp(pte_t pte)
{
- return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD_WP));
+ return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD));
}
static inline pte_t pte_clear_uffd_wp(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_UFFD_WP));
+ return __pte(pte_val(pte) & ~(_PAGE_UFFD));
}
static inline bool pte_swp_uffd_wp(pte_t pte)
{
- return !!(pte_val(pte) & _PAGE_SWP_UFFD_WP);
+ return !!(pte_val(pte) & _PAGE_SWP_UFFD);
}
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_SWP_UFFD_WP);
+ return __pte(pte_val(pte) | _PAGE_SWP_UFFD);
}
static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD_WP));
+ return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -1157,7 +1157,7 @@ static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
* bit 0: _PAGE_PRESENT (zero)
* bit 1 to 2: (zero)
* bit 3: _PAGE_SWP_SOFT_DIRTY
- * bit 4: _PAGE_SWP_UFFD_WP
+ * bit 4: _PAGE_SWP_UFFD
* bit 5: _PAGE_PROT_NONE (zero)
* bit 6: exclusive marker
* bits 7 to 11: swap type
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c7f014cbf0a9..038c806b50a2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -413,17 +413,17 @@ static inline pte_t pte_wrprotect(pte_t pte)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
- return pte_flags(pte) & _PAGE_UFFD_WP;
+ return pte_flags(pte) & _PAGE_UFFD;
}
static inline pte_t pte_mkuffd_wp(pte_t pte)
{
- return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD_WP));
+ return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD));
}
static inline pte_t pte_clear_uffd_wp(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_UFFD_WP);
+ return pte_clear_flags(pte, _PAGE_UFFD);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -528,17 +528,17 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pmd_uffd_wp(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_UFFD_WP;
+ return pmd_flags(pmd) & _PAGE_UFFD;
}
static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
{
- return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD_WP));
+ return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD));
}
static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
+ return pmd_clear_flags(pmd, _PAGE_UFFD);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
@@ -1550,32 +1550,32 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
+ return pte_set_flags(pte, _PAGE_SWP_UFFD);
}
static inline int pte_swp_uffd_wp(pte_t pte)
{
- return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
+ return pte_flags(pte) & _PAGE_SWP_UFFD;
}
static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
+ return pte_clear_flags(pte, _PAGE_SWP_UFFD);
}
static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
+ return pmd_set_flags(pmd, _PAGE_SWP_UFFD);
}
static inline int pmd_swp_uffd_wp(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
+ return pmd_flags(pmd) & _PAGE_SWP_UFFD;
}
static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
+ return pmd_clear_flags(pmd, _PAGE_SWP_UFFD);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 2ec250ba467e..af08d98be930 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -31,7 +31,7 @@
#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
-#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
+#define _PAGE_BIT_UFFD _PAGE_BIT_SOFTW2 /* userfaultfd tracking */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_KERNEL_4K _PAGE_BIT_SOFTW3 /* page must not be converted to large */
@@ -39,7 +39,7 @@
#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */
#define _PAGE_BIT_NOPTISHADOW _PAGE_BIT_SOFTW5 /* No PTI shadow (root PGD) */
#else
-/* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */
+/* Shared with _PAGE_BIT_UFFD which is not supported on 32 bit */
#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW2 /* Saved Dirty bit (leaf) */
#define _PAGE_BIT_NOPTISHADOW _PAGE_BIT_SOFTW2 /* No PTI shadow (root PGD) */
#endif
@@ -111,11 +111,11 @@
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
-#define _PAGE_SWP_UFFD_WP _PAGE_USER
+#define _PAGE_UFFD (_AT(pteval_t, 1) << _PAGE_BIT_UFFD)
+#define _PAGE_SWP_UFFD _PAGE_USER
#else
-#define _PAGE_UFFD_WP (_AT(pteval_t, 0))
-#define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0))
+#define _PAGE_UFFD (_AT(pteval_t, 0))
+#define _PAGE_SWP_UFFD (_AT(pteval_t, 0))
#endif
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
@@ -129,7 +129,7 @@
/*
* The hardware requires shadow stack to be Write=0,Dirty=1. However,
* there are valid cases where the kernel might create read-only PTEs that
- * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty tracking). In
+ * are dirty (e.g., fork(), mprotect(), userfaultfd, soft-dirty tracking). In
* this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bit,
* to avoid creating a wrong "shadow stack" PTEs. Such PTEs have
* (Write=0,SavedDirty=1,Dirty=0) set.
@@ -151,7 +151,7 @@
#define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | \
_PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \
- _PAGE_CC | _PAGE_UFFD_WP)
+ _PAGE_CC | _PAGE_UFFD)
#define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT)
#define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE)
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 07/18] mm: rename uffd-wp PTE accessors to uffd
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (5 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 06/18] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau
` (10 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Userfaultfd RWP will reuse the uffd-wp PTE bit to mark access-tracking
PTEs, alongside the write-protected ones it already marks. The bit's
meaning now depends on the VMA flag (WP or RWP), not on its name.
Rename the kernel-internal names that describe the bit:
- pte/pmd/huge_pte accessors (and swap variants)
- pgtable_supports_uffd() capability query
- SCAN_PTE_UFFD khugepaged enum
The ftrace string emitted by mm_khugepaged_scan_pmd for this enum is
kept as "pte_uffd_wp" so existing trace-based tooling keeps matching.
Pure mechanical rename -- no behavior change.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
arch/arm64/include/asm/pgtable.h | 28 +++++++--------
arch/riscv/include/asm/pgtable.h | 38 ++++++++++----------
arch/s390/include/asm/hugetlb.h | 12 +++----
arch/x86/include/asm/pgtable.h | 24 ++++++-------
fs/proc/task_mmu.c | 44 +++++++++++------------
include/asm-generic/hugetlb.h | 18 +++++-----
include/asm-generic/pgtable_uffd.h | 32 ++++++++---------
include/linux/leafops.h | 4 +--
include/linux/mm_inline.h | 4 +--
include/linux/swapops.h | 4 +--
include/linux/userfaultfd_k.h | 14 ++++----
include/trace/events/huge_memory.h | 2 +-
mm/huge_memory.c | 56 +++++++++++++++---------------
mm/hugetlb.c | 46 ++++++++++++------------
mm/internal.h | 4 +--
mm/khugepaged.c | 22 ++++++------
mm/memory.c | 34 +++++++++---------
mm/migrate.c | 12 +++----
mm/migrate_device.c | 8 ++---
mm/mprotect.c | 12 +++----
mm/mremap.c | 4 +--
mm/page_table_check.c | 8 ++---
mm/rmap.c | 16 ++++-----
mm/swapfile.c | 4 +--
mm/userfaultfd.c | 6 ++--
25 files changed, 228 insertions(+), 228 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3eecb2c17711..c41e4d59dc9f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -341,17 +341,17 @@ static inline pmd_t pmd_mknoncont(pmd_t pmd)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline int pte_uffd_wp(pte_t pte)
+static inline int pte_uffd(pte_t pte)
{
return !!(pte_val(pte) & PTE_UFFD);
}
-static inline pte_t pte_mkuffd_wp(pte_t pte)
+static inline pte_t pte_mkuffd(pte_t pte)
{
return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD)));
}
-static inline pte_t pte_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_clear_uffd(pte_t pte)
{
return clear_pte_bit(pte, __pgprot(PTE_UFFD));
}
@@ -537,17 +537,17 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static inline pte_t pte_swp_mkuffd(pte_t pte)
{
return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
-static inline int pte_swp_uffd_wp(pte_t pte)
+static inline int pte_swp_uffd(pte_t pte)
{
return !!(pte_val(pte) & PTE_SWP_UFFD);
}
-static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD));
}
@@ -590,13 +590,13 @@ static inline int pmd_protnone(pmd_t pmd)
#define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd)))
#define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd)))
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd))
-#define pmd_mkuffd_wp(pmd) pte_pmd(pte_mkuffd_wp(pmd_pte(pmd)))
-#define pmd_clear_uffd_wp(pmd) pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd)))
-#define pmd_swp_uffd_wp(pmd) pte_swp_uffd_wp(pmd_pte(pmd))
-#define pmd_swp_mkuffd_wp(pmd) pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd)))
-#define pmd_swp_clear_uffd_wp(pmd) \
- pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)))
+#define pmd_uffd(pmd) pte_uffd(pmd_pte(pmd))
+#define pmd_mkuffd(pmd) pte_pmd(pte_mkuffd(pmd_pte(pmd)))
+#define pmd_clear_uffd(pmd) pte_pmd(pte_clear_uffd(pmd_pte(pmd)))
+#define pmd_swp_uffd(pmd) pte_swp_uffd(pmd_pte(pmd))
+#define pmd_swp_mkuffd(pmd) pte_pmd(pte_swp_mkuffd(pmd_pte(pmd)))
+#define pmd_swp_clear_uffd(pmd) \
+ pte_pmd(pte_swp_clear_uffd(pmd_pte(pmd)))
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
#define pmd_write(pmd) pte_write(pmd_pte(pmd))
@@ -1512,7 +1512,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
* Encode and decode a swap entry:
* bits 0-1: present (must be zero)
* bits 2: remember PG_anon_exclusive
- * bit 3: remember uffd-wp state
+ * bit 3: remember uffd state
* bits 6-10: swap type
* bit 11: PTE_PRESENT_INVALID (must be zero)
* bits 12-61: swap offset
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ca69948b3ed8..b111e134795e 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -400,35 +400,35 @@ static inline pte_t pte_wrprotect(pte_t pte)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-#define pgtable_supports_uffd_wp() \
+#define pgtable_supports_uffd() \
riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)
-static inline bool pte_uffd_wp(pte_t pte)
+static inline bool pte_uffd(pte_t pte)
{
return !!(pte_val(pte) & _PAGE_UFFD);
}
-static inline pte_t pte_mkuffd_wp(pte_t pte)
+static inline pte_t pte_mkuffd(pte_t pte)
{
return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD));
}
-static inline pte_t pte_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_clear_uffd(pte_t pte)
{
return __pte(pte_val(pte) & ~(_PAGE_UFFD));
}
-static inline bool pte_swp_uffd_wp(pte_t pte)
+static inline bool pte_swp_uffd(pte_t pte)
{
return !!(pte_val(pte) & _PAGE_SWP_UFFD);
}
-static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static inline pte_t pte_swp_mkuffd(pte_t pte)
{
return __pte(pte_val(pte) | _PAGE_SWP_UFFD);
}
-static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD));
}
@@ -886,34 +886,34 @@ static inline pud_t pud_mkspecial(pud_t pud)
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline bool pmd_uffd_wp(pmd_t pmd)
+static inline bool pmd_uffd(pmd_t pmd)
{
- return pte_uffd_wp(pmd_pte(pmd));
+ return pte_uffd(pmd_pte(pmd));
}
-static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_mkuffd(pmd_t pmd)
{
- return pte_pmd(pte_mkuffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_mkuffd(pmd_pte(pmd)));
}
-static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_clear_uffd(pmd_t pmd)
{
- return pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_clear_uffd(pmd_pte(pmd)));
}
-static inline bool pmd_swp_uffd_wp(pmd_t pmd)
+static inline bool pmd_swp_uffd(pmd_t pmd)
{
- return pte_swp_uffd_wp(pmd_pte(pmd));
+ return pte_swp_uffd(pmd_pte(pmd));
}
-static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_mkuffd(pmd_t pmd)
{
- return pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_swp_mkuffd(pmd_pte(pmd)));
}
-static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd)
{
- return pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)));
+ return pte_pmd(pte_swp_clear_uffd(pmd_pte(pmd)));
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 6983e52eaf81..cf8a176ff3d8 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -77,20 +77,20 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
__set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte));
}
-#define __HAVE_ARCH_HUGE_PTE_MKUFFD_WP
-static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+#define __HAVE_ARCH_HUGE_PTE_MKUFFD
+static inline pte_t huge_pte_mkuffd(pte_t pte)
{
return pte;
}
-#define __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD_WP
-static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+#define __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD
+static inline pte_t huge_pte_clear_uffd(pte_t pte)
{
return pte;
}
-#define __HAVE_ARCH_HUGE_PTE_UFFD_WP
-static inline int huge_pte_uffd_wp(pte_t pte)
+#define __HAVE_ARCH_HUGE_PTE_UFFD
+static inline int huge_pte_uffd(pte_t pte)
{
return 0;
}
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 038c806b50a2..d14c84b2a332 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -411,17 +411,17 @@ static inline pte_t pte_wrprotect(pte_t pte)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline int pte_uffd_wp(pte_t pte)
+static inline int pte_uffd(pte_t pte)
{
return pte_flags(pte) & _PAGE_UFFD;
}
-static inline pte_t pte_mkuffd_wp(pte_t pte)
+static inline pte_t pte_mkuffd(pte_t pte)
{
return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD));
}
-static inline pte_t pte_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_clear_uffd(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_UFFD);
}
@@ -526,17 +526,17 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline int pmd_uffd_wp(pmd_t pmd)
+static inline int pmd_uffd(pmd_t pmd)
{
return pmd_flags(pmd) & _PAGE_UFFD;
}
-static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_mkuffd(pmd_t pmd)
{
return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD));
}
-static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_clear_uffd(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_UFFD);
}
@@ -1548,32 +1548,32 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static inline pte_t pte_swp_mkuffd(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_UFFD);
}
-static inline int pte_swp_uffd_wp(pte_t pte)
+static inline int pte_swp_uffd(pte_t pte)
{
return pte_flags(pte) & _PAGE_SWP_UFFD;
}
-static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_UFFD);
}
-static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_mkuffd(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_SWP_UFFD);
}
-static inline int pmd_swp_uffd_wp(pmd_t pmd)
+static inline int pmd_swp_uffd(pmd_t pmd)
{
return pmd_flags(pmd) & _PAGE_SWP_UFFD;
}
-static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_SWP_UFFD);
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e21a38ac745b..1e5f6ee8a3b6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2035,14 +2035,14 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
- if (pte_uffd_wp(pte))
+ if (pte_uffd(pte))
flags |= PM_UFFD_WP;
} else {
softleaf_t entry;
if (pte_swp_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
- if (pte_swp_uffd_wp(pte))
+ if (pte_swp_uffd(pte))
flags |= PM_UFFD_WP;
entry = softleaf_from_pte(pte);
if (pm->show_pfn) {
@@ -2108,7 +2108,7 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
flags |= PM_PRESENT;
if (pmd_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
- if (pmd_uffd_wp(pmd))
+ if (pmd_uffd(pmd))
flags |= PM_UFFD_WP;
if (pm->show_pfn)
frame = pmd_pfn(pmd) + idx;
@@ -2127,7 +2127,7 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
flags |= PM_SWAP;
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
- if (pmd_swp_uffd_wp(pmd))
+ if (pmd_swp_uffd(pmd))
flags |= PM_UFFD_WP;
VM_WARN_ON_ONCE(!pmd_is_migration_entry(pmd));
page = softleaf_to_page(entry);
@@ -2233,14 +2233,14 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
!hugetlb_pmd_shared(ptep))
flags |= PM_MMAP_EXCLUSIVE;
- if (huge_pte_uffd_wp(pte))
+ if (huge_pte_uffd(pte))
flags |= PM_UFFD_WP;
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pte_pfn(pte) +
((addr & ~hmask) >> PAGE_SHIFT);
- } else if (pte_swp_uffd_wp_any(pte)) {
+ } else if (pte_swp_uffd_any(pte)) {
flags |= PM_UFFD_WP;
}
@@ -2441,7 +2441,7 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_PRESENT;
- if (!pte_uffd_wp(pte))
+ if (!pte_uffd(pte))
categories |= PAGE_IS_WRITTEN;
if (p->masks_of_interest & PAGE_IS_FILE) {
@@ -2459,7 +2459,7 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_wp_any(pte))
+ if (!pte_swp_uffd_any(pte))
categories |= PAGE_IS_WRITTEN;
entry = softleaf_from_pte(pte);
@@ -2484,13 +2484,13 @@ static void make_uffd_wp_pte(struct vm_area_struct *vma,
pte_t old_pte;
old_pte = ptep_modify_prot_start(vma, addr, pte);
- ptent = pte_mkuffd_wp(old_pte);
+ ptent = pte_mkuffd(old_pte);
ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
} else if (pte_none(ptent)) {
set_pte_at(vma->vm_mm, addr, pte,
make_pte_marker(PTE_MARKER_UFFD_WP));
} else {
- ptent = pte_swp_mkuffd_wp(ptent);
+ ptent = pte_swp_mkuffd(ptent);
set_pte_at(vma->vm_mm, addr, pte, ptent);
}
}
@@ -2509,7 +2509,7 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
struct page *page;
categories |= PAGE_IS_PRESENT;
- if (!pmd_uffd_wp(pmd))
+ if (!pmd_uffd(pmd))
categories |= PAGE_IS_WRITTEN;
if (p->masks_of_interest & PAGE_IS_FILE) {
@@ -2524,7 +2524,7 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
categories |= PAGE_IS_SOFT_DIRTY;
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pmd_swp_uffd_wp(pmd))
+ if (!pmd_swp_uffd(pmd))
categories |= PAGE_IS_WRITTEN;
if (pmd_swp_soft_dirty(pmd))
categories |= PAGE_IS_SOFT_DIRTY;
@@ -2548,10 +2548,10 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
if (pmd_present(pmd)) {
old = pmdp_invalidate_ad(vma, addr, pmdp);
- pmd = pmd_mkuffd_wp(old);
+ pmd = pmd_mkuffd(old);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
} else if (pmd_is_migration_entry(pmd)) {
- pmd = pmd_swp_mkuffd_wp(pmd);
+ pmd = pmd_swp_mkuffd(pmd);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
}
}
@@ -2573,7 +2573,7 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
if (pte_present(pte)) {
categories |= PAGE_IS_PRESENT;
- if (!huge_pte_uffd_wp(pte))
+ if (!huge_pte_uffd(pte))
categories |= PAGE_IS_WRITTEN;
if (!PageAnon(pte_page(pte)))
categories |= PAGE_IS_FILE;
@@ -2584,7 +2584,7 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_wp_any(pte))
+ if (!pte_swp_uffd_any(pte))
categories |= PAGE_IS_WRITTEN;
if (pte_swp_soft_dirty(pte))
categories |= PAGE_IS_SOFT_DIRTY;
@@ -2612,12 +2612,12 @@ static void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
if (softleaf_is_migration(entry)) {
set_huge_pte_at(vma->vm_mm, addr, ptep,
- pte_swp_mkuffd_wp(ptent), psize);
+ pte_swp_mkuffd(ptent), psize);
} else {
pte_t old_pte, new_pte;
old_pte = huge_ptep_modify_prot_start(vma, addr, ptep);
- new_pte = huge_pte_mkuffd_wp(old_pte);
+ new_pte = huge_pte_mkuffd(old_pte);
huge_ptep_modify_prot_commit(vma, addr, ptep, old_pte, new_pte);
}
}
@@ -2850,8 +2850,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
pte_t ptent = ptep_get(pte);
- if ((pte_present(ptent) && pte_uffd_wp(ptent)) ||
- pte_swp_uffd_wp_any(ptent))
+ if ((pte_present(ptent) && pte_uffd(ptent)) ||
+ pte_swp_uffd_any(ptent))
continue;
make_uffd_wp_pte(vma, addr, pte, ptent);
if (!flush_end)
@@ -2868,8 +2868,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long next = addr + PAGE_SIZE;
pte_t ptent = ptep_get(pte);
- if ((pte_present(ptent) && pte_uffd_wp(ptent)) ||
- pte_swp_uffd_wp_any(ptent))
+ if ((pte_present(ptent) && pte_uffd(ptent)) ||
+ pte_swp_uffd_any(ptent))
continue;
ret = pagemap_scan_output(p->cur_vma_category | PAGE_IS_WRITTEN,
p, addr, &next);
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index e1a2e1b7c8e7..635c41cc3479 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -37,24 +37,24 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
return pte_modify(pte, newprot);
}
-#ifndef __HAVE_ARCH_HUGE_PTE_MKUFFD_WP
-static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+#ifndef __HAVE_ARCH_HUGE_PTE_MKUFFD
+static inline pte_t huge_pte_mkuffd(pte_t pte)
{
- return huge_pte_wrprotect(pte_mkuffd_wp(pte));
+ return huge_pte_wrprotect(pte_mkuffd(pte));
}
#endif
-#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD_WP
-static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD
+static inline pte_t huge_pte_clear_uffd(pte_t pte)
{
- return pte_clear_uffd_wp(pte);
+ return pte_clear_uffd(pte);
}
#endif
-#ifndef __HAVE_ARCH_HUGE_PTE_UFFD_WP
-static inline int huge_pte_uffd_wp(pte_t pte)
+#ifndef __HAVE_ARCH_HUGE_PTE_UFFD
+static inline int huge_pte_uffd(pte_t pte)
{
- return pte_uffd_wp(pte);
+ return pte_uffd(pte);
}
#endif
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 0d85791efdf7..30e88fc1de2f 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -2,79 +2,79 @@
#define _ASM_GENERIC_PGTABLE_UFFD_H
/*
- * Some platforms can customize the uffd-wp bit, making it unavailable
+ * Some platforms can customize the uffd PTE bit, making it unavailable
* even if the architecture provides the resource.
* Adding this API allows architectures to add their own checks for the
* devices on which the kernel is running.
* Note: When overriding it, please make sure the
* CONFIG_HAVE_ARCH_USERFAULTFD_WP is part of this macro.
*/
-#ifndef pgtable_supports_uffd_wp
-#define pgtable_supports_uffd_wp() IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP)
+#ifndef pgtable_supports_uffd
+#define pgtable_supports_uffd() IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP)
#endif
static inline bool uffd_supports_wp_marker(void)
{
- return pgtable_supports_uffd_wp() && IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP);
+ return pgtable_supports_uffd() && IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP);
}
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static __always_inline int pte_uffd_wp(pte_t pte)
+static __always_inline int pte_uffd(pte_t pte)
{
return 0;
}
-static __always_inline int pmd_uffd_wp(pmd_t pmd)
+static __always_inline int pmd_uffd(pmd_t pmd)
{
return 0;
}
-static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
+static __always_inline pte_t pte_mkuffd(pte_t pte)
{
return pte;
}
-static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+static __always_inline pmd_t pmd_mkuffd(pmd_t pmd)
{
return pmd;
}
-static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
+static __always_inline pte_t pte_clear_uffd(pte_t pte)
{
return pte;
}
-static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+static __always_inline pmd_t pmd_clear_uffd(pmd_t pmd)
{
return pmd;
}
-static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+static __always_inline pte_t pte_swp_mkuffd(pte_t pte)
{
return pte;
}
-static __always_inline int pte_swp_uffd_wp(pte_t pte)
+static __always_inline int pte_swp_uffd(pte_t pte)
{
return 0;
}
-static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+static __always_inline pte_t pte_swp_clear_uffd(pte_t pte)
{
return pte;
}
-static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_mkuffd(pmd_t pmd)
{
return pmd;
}
-static inline int pmd_swp_uffd_wp(pmd_t pmd)
+static inline int pmd_swp_uffd(pmd_t pmd)
{
return 0;
}
-static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd)
{
return pmd;
}
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 992cd8bd8ed0..2ce2f37ac883 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -100,8 +100,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
if (pmd_swp_soft_dirty(pmd))
pmd = pmd_swp_clear_soft_dirty(pmd);
- if (pmd_swp_uffd_wp(pmd))
- pmd = pmd_swp_clear_uffd_wp(pmd);
+ if (pmd_swp_uffd(pmd))
+ pmd = pmd_swp_clear_uffd(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
/* Temporary until swp_entry_t eliminated. */
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index a171070e15f0..2811caf4188d 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -600,14 +600,14 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
return false;
/* A uffd-wp wr-protected normal pte */
- if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+ if (unlikely(pte_present(pteval) && pte_uffd(pteval)))
arm_uffd_pte = true;
/*
* A uffd-wp wr-protected swap pte. Note: this should even cover an
* existing pte marker with uffd-wp bit set.
*/
- if (unlikely(pte_swp_uffd_wp_any(pteval)))
+ if (unlikely(pte_swp_uffd_any(pteval)))
arm_uffd_pte = true;
if (unlikely(arm_uffd_pte)) {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 8cfc966eae48..15c6440e38dd 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -73,8 +73,8 @@ static inline pte_t pte_swp_clear_flags(pte_t pte)
pte = pte_swp_clear_exclusive(pte);
if (pte_swp_soft_dirty(pte))
pte = pte_swp_clear_soft_dirty(pte);
- if (pte_swp_uffd_wp(pte))
- pte = pte_swp_clear_uffd_wp(pte);
+ if (pte_swp_uffd(pte))
+ pte = pte_swp_clear_uffd(pte);
return pte;
}
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3ec8e1071673..f4cf5763f92c 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -211,13 +211,13 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
- return userfaultfd_wp(vma) && pte_uffd_wp(pte);
+ return userfaultfd_wp(vma) && pte_uffd(pte);
}
static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
pmd_t pmd)
{
- return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
+ return userfaultfd_wp(vma) && pmd_uffd(pmd);
}
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
@@ -272,10 +272,10 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
}
/*
- * Returns true if this is a swap pte and was uffd-wp wr-protected in either
- * forms (pte marker or a normal swap pte), false otherwise.
+ * Returns true if this swap pte carries uffd-tracked state in either
+ * form (pte marker or a normal swap pte), false otherwise.
*/
-static inline bool pte_swp_uffd_wp_any(pte_t pte)
+static inline bool pte_swp_uffd_any(pte_t pte)
{
if (!uffd_supports_wp_marker())
return false;
@@ -283,7 +283,7 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte)
if (pte_present(pte))
return false;
- if (pte_swp_uffd_wp(pte))
+ if (pte_swp_uffd(pte))
return true;
if (pte_is_uffd_wp_marker(pte))
@@ -424,7 +424,7 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
* Returns true if this is a swap pte and was uffd-wp wr-protected in either
* forms (pte marker or a normal swap pte), false otherwise.
*/
-static inline bool pte_swp_uffd_wp_any(pte_t pte)
+static inline bool pte_swp_uffd_any(pte_t pte)
{
return false;
}
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 291fae364c62..5a48c5406cce 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -16,7 +16,7 @@
EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \
EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
- EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
+ EM( SCAN_PTE_UFFD, "pte_uffd_wp") \
EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
EM( SCAN_PAGE_NULL, "page_null") \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7c895b1d366..d43c2255f47d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1909,8 +1909,8 @@ static void copy_huge_non_present_pmd(
pmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
- if (pmd_swp_uffd_wp(*src_pmd))
- pmd = pmd_swp_mkuffd_wp(pmd);
+ if (pmd_swp_uffd(*src_pmd))
+ pmd = pmd_swp_mkuffd(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
} else if (softleaf_is_device_private(entry)) {
/*
@@ -1923,8 +1923,8 @@ static void copy_huge_non_present_pmd(
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
- if (pmd_swp_uffd_wp(*src_pmd))
- pmd = pmd_swp_mkuffd_wp(pmd);
+ if (pmd_swp_uffd(*src_pmd))
+ pmd = pmd_swp_mkuffd(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
@@ -1944,7 +1944,7 @@ static void copy_huge_non_present_pmd(
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
if (!userfaultfd_wp(dst_vma))
- pmd = pmd_swp_clear_uffd_wp(pmd);
+ pmd = pmd_swp_clear_uffd(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
}
@@ -2040,7 +2040,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
pmdp_set_wrprotect(src_mm, addr, src_pmd);
if (!userfaultfd_wp(dst_vma))
- pmd = pmd_clear_uffd_wp(pmd);
+ pmd = pmd_clear_uffd(pmd);
pmd = pmd_wrprotect(pmd);
set_pmd:
pmd = pmd_mkold(pmd);
@@ -2581,9 +2581,9 @@ static pmd_t clear_uffd_wp_pmd(pmd_t pmd)
if (pmd_none(pmd))
return pmd;
if (pmd_present(pmd))
- pmd = pmd_clear_uffd_wp(pmd);
+ pmd = pmd_clear_uffd(pmd);
else
- pmd = pmd_swp_clear_uffd_wp(pmd);
+ pmd = pmd_swp_clear_uffd(pmd);
return pmd;
}
@@ -2663,16 +2663,16 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
} else if (softleaf_is_device_private_write(entry)) {
entry = make_readable_device_private_entry(swp_offset(entry));
newpmd = swp_entry_to_pmd(entry);
- if (pmd_swp_uffd_wp(*pmd))
- newpmd = pmd_swp_mkuffd_wp(newpmd);
+ if (pmd_swp_uffd(*pmd))
+ newpmd = pmd_swp_mkuffd(newpmd);
} else {
newpmd = *pmd;
}
if (uffd_wp)
- newpmd = pmd_swp_mkuffd_wp(newpmd);
+ newpmd = pmd_swp_mkuffd(newpmd);
else if (uffd_wp_resolve)
- newpmd = pmd_swp_clear_uffd_wp(newpmd);
+ newpmd = pmd_swp_clear_uffd(newpmd);
if (!pmd_same(*pmd, newpmd))
set_pmd_at(mm, addr, pmd, newpmd);
}
@@ -2753,14 +2753,14 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
entry = pmd_modify(oldpmd, newprot);
if (uffd_wp)
- entry = pmd_mkuffd_wp(entry);
+ entry = pmd_mkuffd(entry);
else if (uffd_wp_resolve)
/*
* Leave the write bit to be handled by PF interrupt
* handler, then things like COW could be properly
* handled.
*/
- entry = pmd_clear_uffd_wp(entry);
+ entry = pmd_clear_uffd(entry);
/* See change_pte_range(). */
if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
@@ -3103,8 +3103,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
entry = pfn_pte(zero_pfn(addr), vma->vm_page_prot);
entry = pte_mkspecial(entry);
- if (pmd_uffd_wp(old_pmd))
- entry = pte_mkuffd_wp(entry);
+ if (pmd_uffd(old_pmd))
+ entry = pte_mkuffd(entry);
VM_BUG_ON(!pte_none(ptep_get(pte)));
set_pte_at(mm, addr, pte, entry);
pte++;
@@ -3188,7 +3188,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
folio = page_folio(page);
soft_dirty = pmd_swp_soft_dirty(old_pmd);
- uffd_wp = pmd_swp_uffd_wp(old_pmd);
+ uffd_wp = pmd_swp_uffd(old_pmd);
write = softleaf_is_migration_write(entry);
if (PageAnon(page))
@@ -3204,7 +3204,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
folio = page_folio(page);
soft_dirty = pmd_swp_soft_dirty(old_pmd);
- uffd_wp = pmd_swp_uffd_wp(old_pmd);
+ uffd_wp = pmd_swp_uffd(old_pmd);
write = softleaf_is_device_private_write(entry);
anon_exclusive = PageAnonExclusive(page);
@@ -3261,7 +3261,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);
- uffd_wp = pmd_uffd_wp(old_pmd);
+ uffd_wp = pmd_uffd(old_pmd);
VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
@@ -3332,7 +3332,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
if (uffd_wp)
- entry = pte_swp_mkuffd_wp(entry);
+ entry = pte_swp_mkuffd(entry);
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
set_pte_at(mm, addr, pte + i, entry);
}
@@ -3359,7 +3359,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
if (uffd_wp)
- entry = pte_swp_mkuffd_wp(entry);
+ entry = pte_swp_mkuffd(entry);
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
set_pte_at(mm, addr, pte + i, entry);
}
@@ -3377,7 +3377,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
if (uffd_wp)
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
for (i = 0; i < HPAGE_PMD_NR; i++)
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
@@ -5018,8 +5018,8 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
pmdswp = swp_entry_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
- if (pmd_uffd_wp(pmdval))
- pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+ if (pmd_uffd(pmdval))
+ pmdswp = pmd_swp_mkuffd(pmdswp);
set_pmd_at(mm, address, pvmw->pmd, pmdswp);
folio_remove_rmap_pmd(folio, page, vma);
folio_put(folio);
@@ -5049,8 +5049,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mksoft_dirty(pmde);
if (softleaf_is_migration_write(entry))
pmde = pmd_mkwrite(pmde, vma);
- if (pmd_swp_uffd_wp(*pvmw->pmd))
- pmde = pmd_mkuffd_wp(pmde);
+ if (pmd_swp_uffd(*pvmw->pmd))
+ pmde = pmd_mkuffd(pmde);
if (!softleaf_is_migration_young(entry))
pmde = pmd_mkold(pmde);
/* NOTE: this may contain setting soft-dirty on some archs */
@@ -5070,8 +5070,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_swp_mksoft_dirty(pmde);
- if (pmd_swp_uffd_wp(*pvmw->pmd))
- pmde = pmd_swp_mkuffd_wp(pmde);
+ if (pmd_swp_uffd(*pvmw->pmd))
+ pmde = pmd_swp_mkuffd(pmde);
}
if (folio_test_anon(folio)) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 571212b80835..d0c81a056ae2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4843,8 +4843,8 @@ hugetlb_install_folio(struct vm_area_struct *vma, pte_t *ptep, unsigned long add
__folio_mark_uptodate(new_folio);
hugetlb_add_new_anon_rmap(new_folio, vma, addr);
- if (userfaultfd_wp(vma) && huge_pte_uffd_wp(old))
- newpte = huge_pte_mkuffd_wp(newpte);
+ if (userfaultfd_wp(vma) && huge_pte_uffd(old))
+ newpte = huge_pte_mkuffd(newpte);
set_huge_pte_at(vma->vm_mm, addr, ptep, newpte, sz);
hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
folio_set_hugetlb_migratable(new_folio);
@@ -4918,10 +4918,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = softleaf_from_pte(entry);
if (unlikely(softleaf_is_hwpoison(softleaf))) {
if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd_wp(entry);
+ entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(softleaf_is_migration(softleaf))) {
- bool uffd_wp = pte_swp_uffd_wp(entry);
+ bool uffd = pte_swp_uffd(entry);
if (!softleaf_is_migration_read(softleaf) && cow) {
/*
@@ -4931,12 +4931,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = make_readable_migration_entry(
swp_offset(softleaf));
entry = swp_entry_to_pte(softleaf);
- if (userfaultfd_wp(src_vma) && uffd_wp)
- entry = pte_swp_mkuffd_wp(entry);
+ if (userfaultfd_wp(src_vma) && uffd)
+ entry = pte_swp_mkuffd(entry);
set_huge_pte_at(src, addr, src_pte, entry, sz);
}
if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd_wp(entry);
+ entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(pte_is_marker(entry))) {
const pte_marker marker = copy_pte_marker(softleaf, dst_vma);
@@ -5013,7 +5013,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}
if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd_wp(entry);
+ entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
hugetlb_count_add(npages, dst);
@@ -5061,9 +5061,9 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
} else {
if (need_clear_uffd_wp) {
if (pte_present(pte))
- pte = huge_pte_clear_uffd_wp(pte);
+ pte = huge_pte_clear_uffd(pte);
else
- pte = pte_swp_clear_uffd_wp(pte);
+ pte = pte_swp_clear_uffd(pte);
}
set_huge_pte_at(mm, new_addr, dst_pte, pte, sz);
}
@@ -5197,7 +5197,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
* drop the uffd-wp bit in this zap, then replace the
* pte with a marker.
*/
- if (pte_swp_uffd_wp_any(pte) &&
+ if (pte_swp_uffd_any(pte) &&
!(zap_flags & ZAP_FLAG_DROP_MARKER))
set_huge_pte_at(mm, address, ptep,
make_pte_marker(PTE_MARKER_UFFD_WP),
@@ -5233,7 +5233,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
folio_mark_dirty(folio);
/* Leave a uffd-wp pte marker if needed */
- if (huge_pte_uffd_wp(pte) &&
+ if (huge_pte_uffd(pte) &&
!(zap_flags & ZAP_FLAG_DROP_MARKER))
set_huge_pte_at(mm, address, ptep,
make_pte_marker(PTE_MARKER_UFFD_WP),
@@ -5437,7 +5437,7 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
* can trigger this, because hugetlb_fault() will always resolve
* uffd-wp bit first.
*/
- if (!unshare && huge_pte_uffd_wp(pte))
+ if (!unshare && huge_pte_uffd(pte))
return 0;
/* Let's take out MAP_SHARED mappings first. */
@@ -5581,8 +5581,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
huge_ptep_clear_flush(vma, vmf->address, vmf->pte);
hugetlb_remove_rmap(old_folio);
hugetlb_add_new_anon_rmap(new_folio, vma, vmf->address);
- if (huge_pte_uffd_wp(pte))
- newpte = huge_pte_mkuffd_wp(newpte);
+ if (huge_pte_uffd(pte))
+ newpte = huge_pte_mkuffd(newpte);
set_huge_pte_at(mm, vmf->address, vmf->pte, newpte,
huge_page_size(h));
folio_set_hugetlb_migratable(new_folio);
@@ -5860,7 +5860,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
* if populated.
*/
if (unlikely(pte_is_uffd_wp_marker(vmf->orig_pte)))
- new_pte = huge_pte_mkuffd_wp(new_pte);
+ new_pte = huge_pte_mkuffd(new_pte);
set_huge_pte_at(mm, vmf->address, vmf->pte, new_pte, huge_page_size(h));
hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -6058,7 +6058,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_ptl;
/* Handle userfault-wp first, before trying to lock more pages */
- if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(mm, vmf.address, vmf.pte)) &&
+ if (userfaultfd_wp(vma) && huge_pte_uffd(huge_ptep_get(mm, vmf.address, vmf.pte)) &&
(flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) {
if (!userfaultfd_wp_async(vma)) {
spin_unlock(vmf.ptl);
@@ -6067,7 +6067,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_userfault(&vmf, VM_UFFD_WP);
}
- vmf.orig_pte = huge_pte_clear_uffd_wp(vmf.orig_pte);
+ vmf.orig_pte = huge_pte_clear_uffd(vmf.orig_pte);
set_huge_pte_at(mm, vmf.address, vmf.pte, vmf.orig_pte,
huge_page_size(hstate_vma(vma)));
/* Fallthrough to CoW */
@@ -6352,7 +6352,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
_dst_pte = pte_mkyoung(_dst_pte);
if (wp_enabled)
- _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+ _dst_pte = huge_pte_mkuffd(_dst_pte);
set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte, size);
@@ -6476,9 +6476,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
}
if (uffd_wp)
- newpte = pte_swp_mkuffd_wp(newpte);
+ newpte = pte_swp_mkuffd(newpte);
else if (uffd_wp_resolve)
- newpte = pte_swp_clear_uffd_wp(newpte);
+ newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(pte, newpte))
set_huge_pte_at(mm, address, ptep, newpte, psize);
} else if (unlikely(pte_is_marker(pte))) {
@@ -6499,9 +6499,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_modify(old_pte, newprot);
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
if (uffd_wp)
- pte = huge_pte_mkuffd_wp(pte);
+ pte = huge_pte_mkuffd(pte);
else if (uffd_wp_resolve)
- pte = huge_pte_clear_uffd_wp(pte);
+ pte = huge_pte_clear_uffd(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
tlb_remove_huge_tlb_entry(h, &tlb, ptep, address);
diff --git a/mm/internal.h b/mm/internal.h
index 5602393054f3..9325eefbea6a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,8 +412,8 @@ static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
new = pte_swp_mksoft_dirty(new);
if (pte_swp_exclusive(pte))
new = pte_swp_mkexclusive(new);
- if (pte_swp_uffd_wp(pte))
- new = pte_swp_mkuffd_wp(new);
+ if (pte_swp_uffd(pte))
+ new = pte_swp_mkuffd(new);
return new;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4549a020bf73..afa218be15de 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -37,7 +37,7 @@ enum scan_result {
SCAN_EXCEED_SWAP_PTE,
SCAN_EXCEED_SHARED_PTE,
SCAN_PTE_NON_PRESENT,
- SCAN_PTE_UFFD_WP,
+ SCAN_PTE_UFFD,
SCAN_PTE_MAPPED_HUGEPAGE,
SCAN_LACK_REFERENCED_PAGE,
SCAN_PAGE_NULL,
@@ -712,8 +712,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
result = SCAN_PTE_NON_PRESENT;
goto out;
}
- if (pte_uffd_wp(pteval)) {
- result = SCAN_PTE_UFFD_WP;
+ if (pte_uffd(pteval)) {
+ result = SCAN_PTE_UFFD;
goto out;
}
page = vm_normal_page(vma, addr, pteval);
@@ -1566,7 +1566,7 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
case SCAN_PAGE_NULL:
case SCAN_DEL_PAGE_LRU:
case SCAN_PTE_NON_PRESENT:
- case SCAN_PTE_UFFD_WP:
+ case SCAN_PTE_UFFD:
case SCAN_ALLOC_HUGE_PAGE_FAIL:
case SCAN_PAGE_LAZYFREE:
goto next_order;
@@ -1666,15 +1666,15 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
/*
* Always be strict with uffd-wp
* enabled swap entries. Please see
- * comment below for pte_uffd_wp().
+ * comment below for pte_uffd().
*/
- if (pte_swp_uffd_wp_any(pteval)) {
- result = SCAN_PTE_UFFD_WP;
+ if (pte_swp_uffd_any(pteval)) {
+ result = SCAN_PTE_UFFD;
goto out_unmap;
}
continue;
}
- if (pte_uffd_wp(pteval)) {
+ if (pte_uffd(pteval)) {
/*
* Don't collapse the page if any of the small
* PTEs are armed with uffd write protection.
@@ -1684,7 +1684,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
* userfault messages that falls outside of
* the registered range. So, just be simple.
*/
- result = SCAN_PTE_UFFD_WP;
+ result = SCAN_PTE_UFFD;
goto out_unmap;
}
@@ -1897,7 +1897,7 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
if (userfaultfd_wp(vma))
- return SCAN_PTE_UFFD_WP;
+ return SCAN_PTE_UFFD;
folio = filemap_lock_folio(vma->vm_file->f_mapping,
linear_page_index(vma, haddr));
@@ -3244,7 +3244,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
/* Whitelisted set of results where continuing OK */
case SCAN_NO_PTE_TABLE:
case SCAN_PTE_NON_PRESENT:
- case SCAN_PTE_UFFD_WP:
+ case SCAN_PTE_UFFD:
case SCAN_LACK_REFERENCED_PAGE:
case SCAN_PAGE_NULL:
case SCAN_PAGE_COUNT:
diff --git a/mm/memory.c b/mm/memory.c
index 7c020995eafc..c4fd5cb4a08f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -893,8 +893,8 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(orig_pte))
- pte = pte_mkuffd_wp(pte);
+ if (pte_swp_uffd(orig_pte))
+ pte = pte_mkuffd(pte);
if ((vma->vm_flags & VM_WRITE) &&
can_change_pte_writable(vma, address, pte)) {
@@ -984,8 +984,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte = softleaf_to_pte(entry);
if (pte_swp_soft_dirty(orig_pte))
pte = pte_swp_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(orig_pte))
- pte = pte_swp_mkuffd_wp(pte);
+ if (pte_swp_uffd(orig_pte))
+ pte = pte_swp_mkuffd(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (softleaf_is_device_private(entry)) {
@@ -1018,8 +1018,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
entry = make_readable_device_private_entry(
swp_offset(entry));
pte = swp_entry_to_pte(entry);
- if (pte_swp_uffd_wp(orig_pte))
- pte = pte_swp_mkuffd_wp(pte);
+ if (pte_swp_uffd(orig_pte))
+ pte = pte_swp_mkuffd(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (softleaf_is_device_exclusive(entry)) {
@@ -1042,7 +1042,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return 0;
}
if (!userfaultfd_wp(dst_vma))
- pte = pte_swp_clear_uffd_wp(pte);
+ pte = pte_swp_clear_uffd(pte);
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
}
@@ -1090,7 +1090,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte)))
/* Uffd-wp needs to be delivered to dest pte as well */
- pte = pte_mkuffd_wp(pte);
+ pte = pte_mkuffd(pte);
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
return 0;
}
@@ -1113,7 +1113,7 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
pte = pte_mkold(pte);
if (!userfaultfd_wp(dst_vma))
- pte = pte_clear_uffd_wp(pte);
+ pte = pte_clear_uffd(pte);
set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
}
@@ -3925,8 +3925,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (unlikely(unshare)) {
if (pte_soft_dirty(vmf->orig_pte))
entry = pte_mksoft_dirty(entry);
- if (pte_uffd_wp(vmf->orig_pte))
- entry = pte_mkuffd_wp(entry);
+ if (pte_uffd(vmf->orig_pte))
+ entry = pte_mkuffd(entry);
} else {
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
@@ -4261,7 +4261,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
* etc.) because we're only removing the uffd-wp bit,
* which is completely invisible to the user.
*/
- pte = pte_clear_uffd_wp(ptep_get(vmf->pte));
+ pte = pte_clear_uffd(ptep_get(vmf->pte));
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
/*
@@ -5038,8 +5038,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte = mk_pte(page, vma->vm_page_prot);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(vmf->orig_pte))
- pte = pte_mkuffd_wp(pte);
+ if (pte_swp_uffd(vmf->orig_pte))
+ pte = pte_mkuffd(pte);
/*
* Same logic as in do_wp_page(); however, optimize for pages that are
@@ -5255,7 +5255,7 @@ void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry), vma);
if (uffd_wp)
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
folio_ref_add(folio, nr_pages - 1);
folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
@@ -5322,7 +5322,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_MISSING);
}
if (vmf_orig_pte_uffd_wp(vmf))
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
/* No need to invalidate - it was non-present before */
@@ -5572,7 +5572,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
else if (pte_write(entry) && folio_test_dirty(folio))
entry = pte_mkdirty(entry);
if (unlikely(vmf_orig_pte_uffd_wp(vmf)))
- entry = pte_mkuffd_wp(entry);
+ entry = pte_mkuffd(entry);
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
VM_BUG_ON_FOLIO(nr != 1, folio);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0c6a0ab6ecce..4bdb5be7afbf 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -326,8 +326,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
if (pte_swp_soft_dirty(old_pte))
newpte = pte_mksoft_dirty(newpte);
- if (pte_swp_uffd_wp(old_pte))
- newpte = pte_mkuffd_wp(newpte);
+ if (pte_swp_uffd(old_pte))
+ newpte = pte_mkuffd(newpte);
set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
@@ -391,8 +391,8 @@ static bool remove_migration_pte(struct folio *folio,
if (softleaf_is_migration_write(entry))
pte = pte_mkwrite(pte, vma);
- else if (pte_swp_uffd_wp(old_pte))
- pte = pte_mkuffd_wp(pte);
+ else if (pte_swp_uffd(old_pte))
+ pte = pte_mkuffd(pte);
if (folio_test_anon(folio) && !softleaf_is_migration_read(entry))
rmap_flags |= RMAP_EXCLUSIVE;
@@ -407,8 +407,8 @@ static bool remove_migration_pte(struct folio *folio,
pte = softleaf_to_pte(entry);
if (pte_swp_soft_dirty(old_pte))
pte = pte_swp_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(old_pte))
- pte = pte_swp_mkuffd_wp(pte);
+ if (pte_swp_uffd(old_pte))
+ pte = pte_swp_mkuffd(pte);
}
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 554754eb26ff..17da1bab0248 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -445,13 +445,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (pte_present(pte)) {
if (pte_soft_dirty(pte))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pte))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_uffd(pte))
+ swp_pte = pte_swp_mkuffd(swp_pte);
} else {
if (pte_swp_soft_dirty(pte))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pte))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_swp_uffd(pte))
+ swp_pte = pte_swp_mkuffd(swp_pte);
}
set_pte_at(mm, addr, ptep, swp_pte);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9cbf932b028c..8340c8b228c6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -240,8 +240,8 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
*/
entry = make_readable_device_private_entry(swp_offset(entry));
newpte = swp_entry_to_pte(entry);
- if (pte_swp_uffd_wp(oldpte))
- newpte = pte_swp_mkuffd_wp(newpte);
+ if (pte_swp_uffd(oldpte))
+ newpte = pte_swp_mkuffd(newpte);
} else if (softleaf_is_marker(entry)) {
/*
* Ignore error swap entries unconditionally,
@@ -266,9 +266,9 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
}
if (uffd_wp)
- newpte = pte_swp_mkuffd_wp(newpte);
+ newpte = pte_swp_mkuffd(newpte);
else if (uffd_wp_resolve)
- newpte = pte_swp_clear_uffd_wp(newpte);
+ newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(oldpte, newpte)) {
set_pte_at(vma->vm_mm, addr, pte, newpte);
@@ -290,9 +290,9 @@ static __always_inline void change_present_ptes(struct mmu_gather *tlb,
ptent = pte_modify(oldpte, newprot);
if (uffd_wp)
- ptent = pte_mkuffd_wp(ptent);
+ ptent = pte_mkuffd(ptent);
else if (uffd_wp_resolve)
- ptent = pte_clear_uffd_wp(ptent);
+ ptent = pte_clear_uffd(ptent);
/*
* In some writable, shared mappings, we might want
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832..12732a5c547e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -297,9 +297,9 @@ static int move_ptes(struct pagetable_move_control *pmc,
else {
if (need_clear_uffd_wp) {
if (pte_present(pte))
- pte = pte_clear_uffd_wp(pte);
+ pte = pte_clear_uffd(pte);
else
- pte = pte_swp_clear_uffd_wp(pte);
+ pte = pte_swp_clear_uffd(pte);
}
set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
}
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 53a8997ec043..3fb995e5d40d 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -188,8 +188,8 @@ static inline bool softleaf_cached_writable(softleaf_t entry)
static void page_table_check_pte_flags(pte_t pte)
{
if (pte_present(pte)) {
- WARN_ON_ONCE(pte_uffd_wp(pte) && pte_write(pte));
- } else if (pte_swp_uffd_wp(pte)) {
+ WARN_ON_ONCE(pte_uffd(pte) && pte_write(pte));
+ } else if (pte_swp_uffd(pte)) {
const softleaf_t entry = softleaf_from_pte(pte);
WARN_ON_ONCE(softleaf_cached_writable(entry));
@@ -216,9 +216,9 @@ EXPORT_SYMBOL(__page_table_check_ptes_set);
static inline void page_table_check_pmd_flags(pmd_t pmd)
{
if (pmd_present(pmd)) {
- if (pmd_uffd_wp(pmd))
+ if (pmd_uffd(pmd))
WARN_ON_ONCE(pmd_write(pmd));
- } else if (pmd_swp_uffd_wp(pmd)) {
+ } else if (pmd_swp_uffd(pmd)) {
const softleaf_t entry = softleaf_from_pmd(pmd);
WARN_ON_ONCE(softleaf_cached_writable(entry));
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c77d5dc06e9..546bc1cf9391 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2318,13 +2318,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (likely(pte_present(pteval))) {
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
} else {
if (pte_swp_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_swp_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
}
set_pte_at(mm, address, pvmw.pte, swp_pte);
} else {
@@ -2692,14 +2692,14 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
} else {
swp_pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (pte_swp_uffd(pteval))
+ swp_pte = pte_swp_mkuffd(swp_pte);
}
if (folio_test_hugetlb(folio))
set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3d126602a1e..15fdca2da1f7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2557,8 +2557,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
if (pte_swp_soft_dirty(old_pte))
new_pte = pte_mksoft_dirty(new_pte);
- if (pte_swp_uffd_wp(old_pte))
- new_pte = pte_mkuffd_wp(new_pte);
+ if (pte_swp_uffd(old_pte))
+ new_pte = pte_mkuffd(new_pte);
setpte:
set_pte_at(vma->vm_mm, addr, pte, new_pte);
folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6d2a1c67019..9d74be69873a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -394,7 +394,7 @@ static int mfill_atomic_install_pte(pmd_t *dst_pmd,
if (writable)
_dst_pte = pte_mkwrite(_dst_pte, dst_vma);
if (flags & MFILL_ATOMIC_WP)
- _dst_pte = pte_mkuffd_wp(_dst_pte);
+ _dst_pte = pte_mkuffd(_dst_pte);
ret = -EAGAIN;
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
@@ -3591,7 +3591,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
vm_flags |= VM_UFFD_MISSING;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
- if (!pgtable_supports_uffd_wp())
+ if (!pgtable_supports_uffd())
goto out;
vm_flags |= VM_UFFD_WP;
@@ -4301,7 +4301,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
uffdio_api.features &=
~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
#endif
- if (!pgtable_supports_uffd_wp())
+ if (!pgtable_supports_uffd())
uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
if (!uffd_supports_wp_marker()) {
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (6 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 07/18] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-29 7:24 ` Lorenzo Stoakes
2026-05-26 13:04 ` [PATCH v5 09/18] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau
` (9 subsequent siblings)
17 siblings, 1 reply; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Preparatory patch for userfaultfd read-write protection (RWP). RWP
extends userfaultfd protection from plain write-protection (WP) to
full read-write protection: accesses to an RWP-protected range --
reads as well as writes -- trap through userfaultfd.
Reserve VM_UFFD_RWP, add the userfaultfd_rwp() and
userfaultfd_protected() helpers, and wire up the smaps "ur" entry and
the trace-flag table the rest of the series will use. The flag is
gated on CONFIG_USERFAULTFD_RWP, which is introduced together with the
UAPI in a later patch; until then VM_UFFD_RWP aliases VM_NONE and
every downstream check folds to dead code.
Nothing sets or queries the flag yet.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
Documentation/filesystems/proc.rst | 1 +
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 28 +++++++++++++++++----------
include/linux/userfaultfd_k.h | 31 +++++++++++++++++++++++++-----
include/trace/events/mmflags.h | 7 +++++++
5 files changed, 55 insertions(+), 15 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index db6167befb7b..db28207c5290 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -607,6 +607,7 @@ encoded manner. The codes are the following:
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
ui userfaultfd minor fault
+ ur userfaultfd read-write-protect tracking
ss shadow/guarded control stack page
sl sealed
lf lock on fault pages
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1e5f6ee8a3b6..974c5f4aa533 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1237,6 +1237,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_USERFAULTFD_RWP
+ [ilog2(VM_UFFD_RWP)] = "ur",
+#endif
#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
[ilog2(VM_SHADOW_STACK)] = "ss",
#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 71b11945e4fc..6499cfb61dc4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -362,6 +362,7 @@ enum {
#endif
DECLARE_VMA_BIT(UFFD_MINOR, 41),
DECLARE_VMA_BIT(SEALED, 42),
+ DECLARE_VMA_BIT(UFFD_RWP, 43),
/* Flags that reuse flags above. */
DECLARE_VMA_BIT_ALIAS(PKEY_BIT0, HIGH_ARCH_0),
DECLARE_VMA_BIT_ALIAS(PKEY_BIT1, HIGH_ARCH_1),
@@ -505,6 +506,11 @@ enum {
#else
#define VM_UFFD_MINOR VM_NONE
#endif
+#ifdef CONFIG_USERFAULTFD_RWP
+#define VM_UFFD_RWP INIT_VM_FLAG(UFFD_RWP)
+#else
+#define VM_UFFD_RWP VM_NONE
+#endif
#ifdef CONFIG_64BIT
#define VM_ALLOW_ANY_UNCACHED INIT_VM_FLAG(ALLOW_ANY_UNCACHED)
#define VM_SEALED INIT_VM_FLAG(SEALED)
@@ -642,22 +648,24 @@ enum {
* reconsistuted upon page fault, so necessitate page table copying upon fork.
*
* Note that these flags should be compared with the DESTINATION VMA not the
- * source, as VM_UFFD_WP may not be propagated to destination, while all other
- * flags will be.
+ * source: VM_UFFD_WP and VM_UFFD_RWP may be cleared on the destination
+ * (dup_userfaultfd() -> userfaultfd_reset_ctx() when the parent context did
+ * not negotiate UFFD_FEATURE_EVENT_FORK), while all other flags propagate.
*
* VM_PFNMAP / VM_MIXEDMAP - These contain kernel-mapped data which cannot be
* reasonably reconstructed on page fault.
*
* VM_UFFD_WP - Encodes metadata about an installed uffd
- * write protect handler, which cannot be
- * reconstructed on page fault.
+ * VM_UFFD_RWP write- or read-write-protect handler, which
+ * cannot be reconstructed on page fault.
*
- * We always copy pgtables when dst_vma has uffd-wp
- * enabled even if it's file-backed
- * (e.g. shmem). Because when uffd-wp is enabled,
- * pgtable contains uffd-wp protection information,
- * that's something we can't retrieve from page cache,
- * and skip copying will lose those info.
+ * We always copy pgtables when dst_vma has the
+ * uffd PTE bit in use even if it's file-backed
+ * (e.g. shmem). Because when the uffd bit is
+ * in use, the pgtable contains the protection
+ * information, that's something we can't
+ * retrieve from page cache, and skip copying
+ * will lose those info.
*
* VM_MAYBE_GUARD - Could contain page guard region markers which
* by design are a property of the page tables
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index f4cf5763f92c..0aef628514df 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -21,10 +21,11 @@
#include <linux/hugetlb_inline.h>
/* The set of all possible UFFD-related VM flags. */
-#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
+#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_MINOR | \
+ VM_UFFD_WP | VM_UFFD_RWP)
#define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT, \
- VMA_UFFD_MINOR_BIT)
+ VMA_UFFD_MINOR_BIT, VMA_UFFD_RWP_BIT)
/*
* CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -178,7 +179,7 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
*/
static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
{
- return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+ return vma->vm_flags & (VM_UFFD_MINOR | VM_UFFD_WP | VM_UFFD_RWP);
}
/*
@@ -208,6 +209,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
return vma->vm_flags & VM_UFFD_MINOR;
}
+static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_UFFD_RWP;
+}
+
+static inline bool userfaultfd_protected(struct vm_area_struct *vma)
+{
+ return userfaultfd_wp(vma) || userfaultfd_rwp(vma);
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
@@ -328,6 +339,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
return false;
}
+static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
+{
+ return false;
+}
+
+static inline bool userfaultfd_protected(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
@@ -421,8 +442,8 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
}
/*
- * Returns true if this is a swap pte and was uffd-wp wr-protected in either
- * forms (pte marker or a normal swap pte), false otherwise.
+ * Returns true if this swap pte carries uffd-tracked state in either
+ * form (pte marker or a normal swap pte), false otherwise.
*/
static inline bool pte_swp_uffd_any(pte_t pte)
{
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a6e5a44c9b42..bfface3d0203 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -194,6 +194,12 @@ IF_HAVE_PG_ARCH_3(arch_3)
# define IF_HAVE_UFFD_MINOR(flag, name)
#endif
+#ifdef CONFIG_USERFAULTFD_RWP
+# define IF_HAVE_UFFD_RWP(flag, name) {flag, name},
+#else
+# define IF_HAVE_UFFD_RWP(flag, name)
+#endif
+
#if defined(CONFIG_64BIT) || defined(CONFIG_PPC32)
# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
#else
@@ -215,6 +221,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \
{VM_PFNMAP, "pfnmap" }, \
{VM_MAYBE_GUARD, "maybe_guard" }, \
{VM_UFFD_WP, "uffd_wp" }, \
+IF_HAVE_UFFD_RWP(VM_UFFD_RWP, "uffd_rwp" ) \
{VM_LOCKED, "locked" }, \
{VM_IO, "io" }, \
{VM_SEQ_READ, "seqread" }, \
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 09/18] mm: add MM_CP_UFFD_RWP change_protection() flag
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (7 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-29 1:19 ` SeongJae Park
2026-05-26 13:04 ` [PATCH v5 10/18] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau
` (8 subsequent siblings)
17 siblings, 1 reply; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Preparatory patch. Add the change_protection() primitive that
userfaultfd RWP will use.
An RWP-protected PTE is PAGE_NONE with the uffd PTE bit set. The
PROT_NONE half makes the CPU fault on any access; the uffd bit
distinguishes an RWP fault from a plain mprotect(PROT_NONE) or NUMA
hinting fault. MM_CP_UFFD_WP and MM_CP_UFFD_RWP share the same PTE
bit, so the two cannot be used together on the same range.
Two new change_protection() flags:
MM_CP_UFFD_RWP install PAGE_NONE and set the uffd bit
MM_CP_UFFD_RWP_RESOLVE restore vma->vm_page_prot, clear the uffd bit
Both are wired through change_pte_range(), change_huge_pmd(), and
hugetlb_change_protection() so anon, shmem, THP, and hugetlb all
share the same semantics.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 5 ++++
include/linux/userfaultfd_k.h | 1 -
mm/huge_memory.c | 30 +++++++++++++----------
mm/hugetlb.c | 25 ++++++++++++++-----
mm/mprotect.c | 46 +++++++++++++++++++++++++++--------
5 files changed, 77 insertions(+), 30 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6499cfb61dc4..f79801816f32 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3297,6 +3297,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
#define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)
+/* Whether this change is for uffd RWP */
+#define MM_CP_UFFD_RWP (1UL << 4) /* do rwp */
+#define MM_CP_UFFD_RWP_RESOLVE (1UL << 5) /* resolve rwp */
+#define MM_CP_UFFD_RWP_ALL (MM_CP_UFFD_RWP | \
+ MM_CP_UFFD_RWP_RESOLVE)
bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0aef628514df..564eb2aac321 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -361,7 +361,6 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
return false;
}
-
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d43c2255f47d..40c65bf2d6dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2640,8 +2640,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
}
static void change_non_present_huge_pmd(struct mm_struct *mm,
- unsigned long addr, pmd_t *pmd, bool uffd_wp,
- bool uffd_wp_resolve)
+ unsigned long addr, pmd_t *pmd, bool uffd_prot,
+ bool uffd_prot_resolve)
{
softleaf_t entry = softleaf_from_pmd(*pmd);
const struct folio *folio = softleaf_to_folio(entry);
@@ -2669,9 +2669,9 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
newpmd = *pmd;
}
- if (uffd_wp)
+ if (uffd_prot)
newpmd = pmd_swp_mkuffd(newpmd);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
newpmd = pmd_swp_clear_uffd(newpmd);
if (!pmd_same(*pmd, newpmd))
set_pmd_at(mm, addr, pmd, newpmd);
@@ -2692,8 +2692,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spinlock_t *ptl;
pmd_t oldpmd, entry;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
- bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
- bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ bool uffd_prot = cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP);
+ bool uffd_prot_resolve = cp_flags &
+ (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE);
int ret = 1;
tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
@@ -2706,11 +2707,17 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return 0;
if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) {
- change_non_present_huge_pmd(mm, addr, pmd, uffd_wp,
- uffd_wp_resolve);
+ change_non_present_huge_pmd(mm, addr, pmd, uffd_prot,
+ uffd_prot_resolve);
goto unlock;
}
+ /* Already in the desired state */
+ if (prot_numa && pmd_protnone(*pmd))
+ goto unlock;
+ if ((cp_flags & MM_CP_UFFD_RWP) && pmd_protnone(*pmd) && pmd_uffd(*pmd))
+ goto unlock;
+
if (prot_numa) {
/*
@@ -2721,9 +2728,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(*pmd))
goto unlock;
- if (pmd_protnone(*pmd))
- goto unlock;
-
if (!folio_can_map_prot_numa(pmd_folio(*pmd), vma,
vma_is_single_threaded_private(vma)))
goto unlock;
@@ -2752,9 +2756,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
entry = pmd_modify(oldpmd, newprot);
- if (uffd_wp)
+ if (uffd_prot)
entry = pmd_mkuffd(entry);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
/*
* Leave the write bit to be handled by PF interrupt
* handler, then things like COW could be properly
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0c81a056ae2..4d75b69d4272 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6395,6 +6395,8 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long last_addr_mask;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ bool uffd_rwp = cp_flags & MM_CP_UFFD_RWP;
+ bool uffd_rwp_resolve = cp_flags & MM_CP_UFFD_RWP_RESOLVE;
struct mmu_gather tlb;
/*
@@ -6420,6 +6422,11 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
ptep = hugetlb_walk(vma, address, psize);
if (!ptep) {
+ /*
+ * uffd_wp installs a pte marker on the unpopulated
+ * entry; uffd_rwp does not install markers so the
+ * allocation is unnecessary for it.
+ */
if (!uffd_wp) {
address |= last_addr_mask;
continue;
@@ -6441,7 +6448,8 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
* shouldn't happen at all. Warn about it if it
* happened due to some reason.
*/
- WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
+ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve ||
+ uffd_rwp || uffd_rwp_resolve);
pages++;
spin_unlock(ptl);
address |= last_addr_mask;
@@ -6475,9 +6483,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pages++;
}
- if (uffd_wp)
+ if (uffd_wp || uffd_rwp)
newpte = pte_swp_mkuffd(newpte);
- else if (uffd_wp_resolve)
+ else if (uffd_wp_resolve || uffd_rwp_resolve)
newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(pte, newpte))
set_huge_pte_at(mm, address, ptep, newpte, psize);
@@ -6488,19 +6496,24 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
* pte_marker_uffd_wp()==true implies !poison
* because they're mutual exclusive.
*/
- if (pte_is_uffd_wp_marker(pte) && uffd_wp_resolve)
+ if (pte_is_uffd_wp_marker(pte) &&
+ (uffd_wp_resolve || uffd_rwp_resolve))
/* Safe to modify directly (non-present->none). */
huge_pte_clear(mm, address, ptep, psize);
} else {
pte_t old_pte;
unsigned int shift = huge_page_shift(hstate_vma(vma));
+ /* Already protnone with uffd bit set? Nothing to do. */
+ if (uffd_rwp && pte_protnone(pte) && huge_pte_uffd(pte))
+ goto next;
+
old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
pte = huge_pte_modify(old_pte, newprot);
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
- if (uffd_wp)
+ if (uffd_wp || uffd_rwp)
pte = huge_pte_mkuffd(pte);
- else if (uffd_wp_resolve)
+ else if (uffd_wp_resolve || uffd_rwp_resolve)
pte = huge_pte_clear_uffd(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8340c8b228c6..4a6b35482aee 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -214,8 +214,9 @@ static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_stru
static long change_softleaf_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags)
{
- const bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
- const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ const bool uffd_prot = cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP);
+ const bool uffd_prot_resolve = cp_flags &
+ (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE);
softleaf_t entry = softleaf_from_pte(oldpte);
pte_t newpte;
@@ -256,7 +257,7 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
* to unprotect it, drop it; the next page
* fault will trigger without uffd trapping.
*/
- if (uffd_wp_resolve) {
+ if (uffd_prot_resolve) {
pte_clear(vma->vm_mm, addr, pte);
return 1;
}
@@ -265,9 +266,9 @@ static long change_softleaf_pte(struct vm_area_struct *vma,
newpte = oldpte;
}
- if (uffd_wp)
+ if (uffd_prot)
newpte = pte_swp_mkuffd(newpte);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
newpte = pte_swp_clear_uffd(newpte);
if (!pte_same(oldpte, newpte)) {
@@ -282,16 +283,17 @@ static __always_inline void change_present_ptes(struct mmu_gather *tlb,
int nr_ptes, unsigned long end, pgprot_t newprot,
struct folio *folio, struct page *page, unsigned long cp_flags)
{
- const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
- const bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ const bool uffd_prot = cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP);
+ const bool uffd_prot_resolve = cp_flags &
+ (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE);
pte_t ptent, oldpte;
oldpte = modify_prot_start_ptes(vma, addr, ptep, nr_ptes);
ptent = pte_modify(oldpte, newprot);
- if (uffd_wp)
+ if (uffd_prot)
ptent = pte_mkuffd(ptent);
- else if (uffd_wp_resolve)
+ else if (uffd_prot_resolve)
ptent = pte_clear_uffd(ptent);
/*
@@ -325,6 +327,7 @@ static long change_pte_range(struct mmu_gather *tlb,
long pages = 0;
bool is_private_single_threaded;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+ bool uffd_rwp = cp_flags & MM_CP_UFFD_RWP;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
int nr_ptes;
@@ -350,6 +353,14 @@ static long change_pte_range(struct mmu_gather *tlb,
/* Already in the desired state. */
if (prot_numa && pte_protnone(oldpte))
continue;
+ /*
+ * RWP-protected PTEs carry _PAGE_UFFD as a marker on
+ * top of PROT_NONE. Skip only entries already in that
+ * exact state; plain PROT_NONE from mprotect() still needs
+ * to be promoted so future faults can be distinguished.
+ */
+ if (uffd_rwp && pte_protnone(oldpte) && pte_uffd(oldpte))
+ continue;
page = vm_normal_page(vma, addr, oldpte);
if (page)
@@ -358,6 +369,8 @@ static long change_pte_range(struct mmu_gather *tlb,
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
+ * Skip this filter for uffd RWP which
+ * must set protnone regardless of NUMA placement.
*/
if (prot_numa &&
!folio_can_map_prot_numa(folio, vma,
@@ -667,7 +680,16 @@ long change_protection(struct mmu_gather *tlb,
pgprot_t newprot = vma->vm_page_prot;
long pages;
- BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
+ /*
+ * MM_CP_UFFD_{WP,RWP} and _RESOLVE are mutually exclusive within one
+ * change, and WP and RWP cannot mix. Miswired callers get a warn and
+ * a no-op; userspace cannot reach this state.
+ */
+ if (WARN_ON_ONCE((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL ||
+ (cp_flags & MM_CP_UFFD_RWP_ALL) == MM_CP_UFFD_RWP_ALL ||
+ ((cp_flags & MM_CP_UFFD_WP_ALL) &&
+ (cp_flags & MM_CP_UFFD_RWP_ALL))))
+ return 0;
#ifdef CONFIG_NUMA_BALANCING
/*
@@ -681,6 +703,10 @@ long change_protection(struct mmu_gather *tlb,
WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
#endif
+ if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE) &&
+ (cp_flags & MM_CP_UFFD_RWP))
+ newprot = PAGE_NONE;
+
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot,
cp_flags);
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 10/18] mm: preserve RWP marker across PTE rewrites
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (8 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 09/18] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 11/18] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau
` (7 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
The uffd PTE bit must survive any kernel path that rewrites a PTE
on a VM_UFFD_RWP VMA, otherwise the marker that carries PAGE_NONE
semantics is silently dropped and the next access leaks past RWP
tracking. Wire the preservation through every path that rewrites a
VM_UFFD_RWP PTE.
Swap and device-exclusive: do_swap_page(), restore_exclusive_pte(),
and unuse_pte() (swapoff()) re-apply PAGE_NONE when the swap PTE
carries the uffd bit and the VMA has VM_UFFD_RWP.
Migration: remove_migration_pte() and remove_migration_pmd() do the
same after the migration entry is replaced with a real PTE/PMD.
Fork: __copy_present_ptes(), copy_present_page(), copy_nonpresent_pte(),
copy_huge_pmd(), copy_huge_non_present_pmd(), and
copy_hugetlb_page_range() keep the uffd bit on the child when the
destination VMA has VM_UFFD_RWP, matching the existing VM_UFFD_WP
handling. Add VM_UFFD_RWP to VM_COPY_ON_FORK so the flag itself
propagates.
mprotect(): change_pte_range() and change_huge_pmd() restore PAGE_NONE
after pte_modify()/pmd_modify() have recomputed the base protection
from a (possibly user-changed) vm_page_prot. pte_modify() preserves
_PAGE_UFFD, so the bit stays; we just have to force PAGE_NONE back
on top.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 3 ++-
mm/huge_memory.c | 47 +++++++++++++++++++++++++++++++++++++----
mm/hugetlb.c | 52 ++++++++++++++++++++++++++++++++++++++--------
mm/memory.c | 49 ++++++++++++++++++++++++++++++++++++-------
mm/migrate.c | 8 +++++++
mm/mprotect.c | 10 +++++++++
mm/mremap.c | 13 ++++++++++--
mm/swapfile.c | 5 +++++
mm/userfaultfd.c | 17 +++++++++++++++
9 files changed, 181 insertions(+), 23 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f79801816f32..9e62946af654 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -672,7 +672,8 @@ enum {
* only and thus cannot be reconstructed on page
* fault.
*/
-#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_MAYBE_GUARD)
+#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_UFFD_RWP | \
+ VM_MAYBE_GUARD)
/*
* mapping from the currently active vm_flags protection bits (the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40c65bf2d6dc..6417d883d2e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1943,7 +1943,7 @@ static void copy_huge_non_present_pmd(
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
pmd = pmd_swp_clear_uffd(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
}
@@ -2038,9 +2038,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
out_zero_page:
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- pmdp_set_wrprotect(src_mm, addr, src_pmd);
- if (!userfaultfd_wp(dst_vma))
+
+ /* See __copy_present_ptes(): restore accessible protection. */
+ if (!userfaultfd_protected(dst_vma)) {
+ if (userfaultfd_rwp(src_vma) && pmd_uffd(pmd))
+ pmd = pmd_modify(pmd, dst_vma->vm_page_prot);
pmd = pmd_clear_uffd(pmd);
+ }
+
+ pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_wrprotect(pmd);
set_pmd:
pmd = pmd_mkold(pmd);
@@ -2626,8 +2632,16 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
}
pmd = move_soft_dirty_pmd(pmd);
- if (vma_has_uffd_without_event_remap(vma))
+ if (vma_has_uffd_without_event_remap(vma)) {
+ /*
+ * See __copy_present_ptes(): normalise RWP PMDs so
+ * the destination starts accessible instead of taking
+ * a numa-hinting fault on first access.
+ */
+ if (pmd_present(pmd) && userfaultfd_rwp(vma))
+ pmd = pmd_modify(pmd, vma->vm_page_prot);
pmd = clear_uffd_wp_pmd(pmd);
+ }
set_pmd_at(mm, new_addr, new_pmd, pmd);
if (force_flush)
flush_pmd_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
@@ -2766,6 +2780,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
*/
entry = pmd_clear_uffd(entry);
+ /* See change_pte_range(): preserve RWP protection across mprotect() */
+ if (userfaultfd_rwp(vma) && pmd_uffd(entry))
+ entry = pmd_modify(entry, PAGE_NONE);
+
/* See change_pte_range(). */
if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
can_change_pmd_writable(vma, addr, entry))
@@ -2933,6 +2951,13 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
_dst_pmd = move_soft_dirty_pmd(src_pmdval);
_dst_pmd = clear_uffd_wp_pmd(_dst_pmd);
}
+
+ /* Re-arm RWP on the moved PMD if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma)) {
+ _dst_pmd = pmd_modify(_dst_pmd, PAGE_NONE);
+ _dst_pmd = pmd_mkuffd(_dst_pmd);
+ }
+
set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
@@ -3109,6 +3134,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
entry = pte_mkspecial(entry);
if (pmd_uffd(old_pmd))
entry = pte_mkuffd(entry);
+
+ /* Restore PAGE_NONE so an RWP marker keeps trapping */
+ if (userfaultfd_rwp(vma) && pmd_uffd(old_pmd))
+ entry = pte_modify(entry, PAGE_NONE);
+
VM_BUG_ON(!pte_none(ptep_get(pte)));
set_pte_at(mm, addr, pte, entry);
pte++;
@@ -3383,6 +3413,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (uffd_wp)
entry = pte_mkuffd(entry);
+ /* Restore PAGE_NONE so an RWP marker keeps trapping */
+ if (userfaultfd_rwp(vma) && uffd_wp)
+ entry = pte_modify(entry, PAGE_NONE);
+
for (i = 0; i < HPAGE_PMD_NR; i++)
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
@@ -5055,6 +5089,11 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mkwrite(pmde, vma);
if (pmd_swp_uffd(*pvmw->pmd))
pmde = pmd_mkuffd(pmde);
+
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pmd_swp_uffd(*pvmw->pmd) && userfaultfd_rwp(vma))
+ pmde = pmd_modify(pmde, PAGE_NONE);
+
if (!softleaf_is_migration_young(entry))
pmde = pmd_mkold(pmde);
/* NOTE: this may contain setting soft-dirty on some archs */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4d75b69d4272..0d8d39cd8888 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4843,8 +4843,16 @@ hugetlb_install_folio(struct vm_area_struct *vma, pte_t *ptep, unsigned long add
__folio_mark_uptodate(new_folio);
hugetlb_add_new_anon_rmap(new_folio, vma, addr);
- if (userfaultfd_wp(vma) && huge_pte_uffd(old))
+ if (userfaultfd_protected(vma) && huge_pte_uffd(old)) {
newpte = huge_pte_mkuffd(newpte);
+ /* Restore PAGE_NONE so the RWP marker keeps trapping. */
+ if (userfaultfd_rwp(vma)) {
+ unsigned int shift = huge_page_shift(hstate_vma(vma));
+
+ newpte = huge_pte_modify(newpte, PAGE_NONE);
+ newpte = arch_make_huge_pte(newpte, shift, vma->vm_flags);
+ }
+ }
set_huge_pte_at(vma->vm_mm, addr, ptep, newpte, sz);
hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
folio_set_hugetlb_migratable(new_folio);
@@ -4917,7 +4925,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = softleaf_from_pte(entry);
if (unlikely(softleaf_is_hwpoison(softleaf))) {
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(softleaf_is_migration(softleaf))) {
@@ -4931,11 +4939,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
softleaf = make_readable_migration_entry(
swp_offset(softleaf));
entry = swp_entry_to_pte(softleaf);
- if (userfaultfd_wp(src_vma) && uffd)
+ if (userfaultfd_protected(src_vma) && uffd)
entry = pte_swp_mkuffd(entry);
set_huge_pte_at(src, addr, src_pte, entry, sz);
}
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
entry = huge_pte_clear_uffd(entry);
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
} else if (unlikely(pte_is_marker(entry))) {
@@ -5000,6 +5008,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
goto next;
}
+ /* See __copy_present_ptes(): restore accessible protection. */
+ if (!userfaultfd_protected(dst_vma)) {
+ if (userfaultfd_rwp(src_vma) && huge_pte_uffd(entry)) {
+ entry = huge_pte_modify(entry, dst_vma->vm_page_prot);
+ entry = arch_make_huge_pte(entry, huge_page_shift(h),
+ dst_vma->vm_flags);
+ }
+ entry = huge_pte_clear_uffd(entry);
+ }
+
if (cow) {
/*
* No need to notify as we are downgrading page
@@ -5012,9 +5030,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
entry = huge_pte_wrprotect(entry);
}
- if (!userfaultfd_wp(dst_vma))
- entry = huge_pte_clear_uffd(entry);
-
set_huge_pte_at(dst, addr, dst_pte, entry, sz);
hugetlb_count_add(npages, dst);
}
@@ -5060,10 +5075,22 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
huge_pte_clear(mm, new_addr, dst_pte, sz);
} else {
if (need_clear_uffd_wp) {
- if (pte_present(pte))
+ if (pte_present(pte)) {
+ /*
+ * See __copy_present_ptes(): normalise RWP
+ * PTEs so the destination starts accessible
+ * instead of taking a numa-hinting fault on
+ * first access.
+ */
+ if (userfaultfd_rwp(vma)) {
+ pte = huge_pte_modify(pte, vma->vm_page_prot);
+ pte = arch_make_huge_pte(pte, huge_page_shift(h),
+ vma->vm_flags);
+ }
pte = huge_pte_clear_uffd(pte);
- else
+ } else {
pte = pte_swp_clear_uffd(pte);
+ }
}
set_huge_pte_at(mm, new_addr, dst_pte, pte, sz);
}
@@ -6515,6 +6542,13 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_mkuffd(pte);
else if (uffd_wp_resolve || uffd_rwp_resolve)
pte = huge_pte_clear_uffd(pte);
+
+ /* Preserve RWP protection across mprotect() */
+ if (userfaultfd_rwp(vma) && huge_pte_uffd(pte)) {
+ pte = huge_pte_modify(pte, PAGE_NONE);
+ pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ }
+
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
tlb_remove_huge_tlb_entry(h, &tlb, ptep, address);
diff --git a/mm/memory.c b/mm/memory.c
index c4fd5cb4a08f..06473285c0dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -896,6 +896,10 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
if (pte_swp_uffd(orig_pte))
pte = pte_mkuffd(pte);
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(orig_pte) && userfaultfd_rwp(vma))
+ pte = pte_modify(pte, PAGE_NONE);
+
if ((vma->vm_flags & VM_WRITE) &&
can_change_pte_writable(vma, address, pte)) {
if (folio_test_dirty(folio))
@@ -1041,7 +1045,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
make_pte_marker(marker));
return 0;
}
- if (!userfaultfd_wp(dst_vma))
+ if (!userfaultfd_protected(dst_vma))
pte = pte_swp_clear_uffd(pte);
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
@@ -1088,9 +1092,13 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
/* All done, just insert the new page copy in the child */
pte = folio_mk_pte(new_folio, dst_vma->vm_page_prot);
pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
- if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte)))
- /* Uffd-wp needs to be delivered to dest pte as well */
+ if (userfaultfd_protected(dst_vma) && pte_uffd(ptep_get(src_pte))) {
+ /* The uffd bit needs to be delivered to the dest pte as well */
pte = pte_mkuffd(pte);
+ /* Restore PAGE_NONE so the RWP marker keeps trapping */
+ if (userfaultfd_rwp(dst_vma))
+ pte = pte_modify(pte, PAGE_NONE);
+ }
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
return 0;
}
@@ -1100,9 +1108,31 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
pte_t pte, unsigned long addr, int nr)
{
struct mm_struct *src_mm = src_vma->vm_mm;
+ bool writable;
+
+ /*
+ * Snapshot writability before the RWP-disarm rewrite below: when the
+ * child is not RWP-armed, pte_modify(pte, dst_vma->vm_page_prot) can
+ * silently drop _PAGE_RW from a resolved (no-marker) writable PTE,
+ * so a later pte_write(pte) check would skip the COW wrprotect and
+ * leave the parent writable over a folio shared with the child.
+ */
+ writable = pte_write(pte);
+
+ /*
+ * Child is not RWP-armed: restore accessible protection so the
+ * inherited PAGE_NONE does not cost a fault on first read. Gate on
+ * pte_uffd(pte) so unrelated PAGE_NONE markers (e.g. NUMA balancing)
+ * are not normalised away.
+ */
+ if (!userfaultfd_protected(dst_vma)) {
+ if (userfaultfd_rwp(src_vma) && pte_uffd(pte))
+ pte = pte_modify(pte, dst_vma->vm_page_prot);
+ pte = pte_clear_uffd(pte);
+ }
/* If it's a COW mapping, write protect it both processes. */
- if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
+ if (is_cow_mapping(src_vma->vm_flags) && writable) {
wrprotect_ptes(src_mm, addr, src_pte, nr);
pte = pte_wrprotect(pte);
}
@@ -1112,9 +1142,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
- if (!userfaultfd_wp(dst_vma))
- pte = pte_clear_uffd(pte);
-
set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
}
@@ -5041,6 +5068,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (pte_swp_uffd(vmf->orig_pte))
pte = pte_mkuffd(pte);
+ /*
+ * A page reclaimed while RWP-protected carries the uffd bit on
+ * its swap entry. Re-apply PAGE_NONE on swap-in so the first access
+ * still traps as an RWP fault. pte_modify() preserves _PAGE_UFFD.
+ */
+ if (pte_swp_uffd(vmf->orig_pte) && userfaultfd_rwp(vma))
+ pte = pte_modify(pte, PAGE_NONE);
+
/*
* Same logic as in do_wp_page(); however, optimize for pages that are
* certainly not shared either because we just allocated them without
diff --git a/mm/migrate.c b/mm/migrate.c
index 4bdb5be7afbf..8d7fd0b056b6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -329,6 +329,10 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
if (pte_swp_uffd(old_pte))
newpte = pte_mkuffd(newpte);
+ /* See remove_migration_pte(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(old_pte) && userfaultfd_rwp(pvmw->vma))
+ newpte = pte_modify(newpte, PAGE_NONE);
+
set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
@@ -394,6 +398,10 @@ static bool remove_migration_pte(struct folio *folio,
else if (pte_swp_uffd(old_pte))
pte = pte_mkuffd(pte);
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(old_pte) && userfaultfd_rwp(vma))
+ pte = pte_modify(pte, PAGE_NONE);
+
if (folio_test_anon(folio) && !softleaf_is_migration_read(entry))
rmap_flags |= RMAP_EXCLUSIVE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4a6b35482aee..e0b5fe7c66b2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -296,6 +296,16 @@ static __always_inline void change_present_ptes(struct mmu_gather *tlb,
else if (uffd_prot_resolve)
ptent = pte_clear_uffd(ptent);
+ /*
+ * The uffd bit on a VM_UFFD_RWP VMA carries PROT_NONE
+ * semantics. If mprotect() or NUMA hinting changed the
+ * base protection, restore PAGE_NONE so the PTE still
+ * traps on any access. pte_modify() preserves
+ * _PAGE_UFFD.
+ */
+ if (userfaultfd_rwp(vma) && pte_uffd(ptent))
+ ptent = pte_modify(ptent, PAGE_NONE);
+
/*
* In some writable, shared mappings, we might want
* to catch actual write access -- see
diff --git a/mm/mremap.c b/mm/mremap.c
index 12732a5c547e..8a46ec5831c8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -296,10 +296,19 @@ static int move_ptes(struct pagetable_move_control *pmc,
pte_clear(mm, new_addr, new_ptep);
else {
if (need_clear_uffd_wp) {
- if (pte_present(pte))
+ if (pte_present(pte)) {
+ /*
+ * See __copy_present_ptes(): normalise
+ * RWP PTEs so the destination starts
+ * accessible instead of taking a
+ * numa-hinting fault on first access.
+ */
+ if (userfaultfd_rwp(vma) && pte_uffd(pte))
+ pte = pte_modify(pte, vma->vm_page_prot);
pte = pte_clear_uffd(pte);
- else
+ } else {
pte = pte_swp_clear_uffd(pte);
+ }
}
set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 15fdca2da1f7..27cc299ead9b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2559,6 +2559,11 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
new_pte = pte_mksoft_dirty(new_pte);
if (pte_swp_uffd(old_pte))
new_pte = pte_mkuffd(new_pte);
+
+ /* See do_swap_page(): restore PAGE_NONE for RWP */
+ if (pte_swp_uffd(old_pte) && userfaultfd_rwp(vma))
+ new_pte = pte_modify(new_pte, PAGE_NONE);
+
setpte:
set_pte_at(vma->vm_mm, addr, pte, new_pte);
folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9d74be69873a..e30878e4e00b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1285,6 +1285,13 @@ static long move_present_ptes(struct mm_struct *mm,
if (pte_dirty(orig_src_pte))
orig_dst_pte = pte_mkdirty(orig_dst_pte);
orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
+
+ /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma)) {
+ orig_dst_pte = pte_modify(orig_dst_pte, PAGE_NONE);
+ orig_dst_pte = pte_mkuffd(orig_dst_pte);
+ }
+
set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
src_addr += PAGE_SIZE;
@@ -1366,6 +1373,9 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
if (pgtable_supports_soft_dirty())
orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
+ /* Re-arm RWP on the moved swap entry if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma))
+ orig_src_pte = pte_swp_mkuffd(orig_src_pte);
set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
double_pt_unlock(dst_ptl, src_ptl);
@@ -1392,6 +1402,13 @@ static int move_zeropage_pte(struct mm_struct *mm,
zero_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr),
dst_vma->vm_page_prot));
+
+ /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */
+ if (userfaultfd_rwp(dst_vma)) {
+ zero_pte = pte_modify(zero_pte, PAGE_NONE);
+ zero_pte = pte_mkuffd(zero_pte);
+ }
+
ptep_clear_flush(src_vma, src_addr, src_pte);
set_pte_at(mm, dst_addr, dst_pte, zero_pte);
double_pt_unlock(dst_ptl, src_ptl);
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 11/18] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (9 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 10/18] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau
@ 2026-05-26 13:04 ` Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 12/18] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau
` (6 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:04 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Three mm paths outside the fault handler gate on the uffd PTE bit
today: khugepaged (skip collapse on ranges carrying markers), rmap
(cap unmap batching), and GUP (force a fault through
gup_can_follow_protnone). Extend each to treat VM_UFFD_RWP the same
as VM_UFFD_WP; otherwise per-PTE RWP state is silently destroyed or
bypassed.
khugepaged: try_collapse_pte_mapped_thp() and
file_backed_vma_is_retractable() already refuse to collapse or
retract page tables on ranges carrying the uffd PTE bit. Broaden the
VMA predicate from userfaultfd_wp() to userfaultfd_protected() so
VM_UFFD_RWP ranges get the same protection. hpage_collapse_scan_pmd()
needs no change — its existing pte_uffd() check already catches an
RWP PTE because it carries the uffd bit.
rmap: folio_unmap_pte_batch() caps batching at 1 for VM_UFFD_RWP so
the restore path handles each PTE with its own marker.
GUP: gup_can_follow_protnone() forces a fault on VM_UFFD_RWP VMAs
regardless of FOLL_HONOR_NUMA_FAULT. RWP uses protnone as an
access-tracking marker, not for NUMA hinting, so any GUP — read or
write — must go through the userfaultfd fault path.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 10 +++++++++-
mm/khugepaged.c | 18 +++++++++++-------
mm/rmap.c | 2 +-
3 files changed, 21 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9e62946af654..87db714e1364 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4611,11 +4611,19 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
/*
* Indicates whether GUP can follow a PROT_NONE mapped page, or whether
- * a (NUMA hinting) fault is required.
+ * a (NUMA hinting or userfaultfd RWP) fault is required.
*/
static inline bool gup_can_follow_protnone(const struct vm_area_struct *vma,
unsigned int flags)
{
+ /*
+ * VM_UFFD_RWP uses protnone as an access-tracking marker, not for
+ * NUMA hinting. GUP must always take a fault so the access is
+ * delivered to userfaultfd, regardless of FOLL_HONOR_NUMA_FAULT.
+ */
+ if (vma->vm_flags & VM_UFFD_RWP)
+ return false;
+
/*
* If callers don't want to honor NUMA hinting faults, no need to
* determine if we would actually have to trigger a NUMA hinting fault.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index afa218be15de..4f3fedcd75cf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1895,8 +1895,11 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign
if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
return SCAN_VMA_CHECK;
- /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
- if (userfaultfd_wp(vma))
+ /*
+ * Keep pmd pgtable while the uffd bit is in use; see comment in
+ * retract_page_tables().
+ */
+ if (userfaultfd_protected(vma))
return SCAN_PTE_UFFD;
folio = filemap_lock_folio(vma->vm_file->f_mapping,
@@ -2109,13 +2112,14 @@ static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
return false;
/*
- * When a vma is registered with uffd-wp, we cannot recycle
+ * When a vma is registered with uffd-wp or RWP, we cannot recycle
* the page table because there may be pte markers installed.
- * Other vmas can still have the same file mapped hugely, but
- * skip this one: it will always be mapped in small page size
- * for uffd-wp registered ranges.
+ * VM_UFFD_RWP ranges similarly rely on per-PTE uffd state
+ * and cannot be recycled to a shared PMD. Other vmas can still
+ * have the same file mapped hugely, but skip this one: it will
+ * always be mapped in small page size for these registrations.
*/
- if (userfaultfd_wp(vma))
+ if (userfaultfd_protected(vma))
return false;
/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 546bc1cf9391..9fb733489898 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1965,7 +1965,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
if (pte_unused(pte))
return 1;
- if (userfaultfd_wp(vma))
+ if (userfaultfd_protected(vma))
return 1;
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 12/18] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (10 preceding siblings ...)
2026-05-26 13:04 ` [PATCH v5 11/18] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau
@ 2026-05-26 13:05 ` Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 13/18] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau
` (5 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:05 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Add the userspace interface for read-write protection tracking:
- UFFDIO_REGISTER_MODE_RWP register a range for RWP tracking
- UFFD_FEATURE_RWP capability bit
- UFFDIO_RWPROTECT install / remove RWP on a range
Introduce CONFIG_USERFAULTFD_RWP, auto-selected on 64-bit kernels with
ARCH_HAS_PTE_PROTNONE and HAVE_ARCH_USERFAULTFD_WP. The symbol gates
VM_UFFD_RWP (previously aliased to VM_NONE) and the smaps/trace-flag
hooks added in the preparatory patches; without it the UAPI bits added
here have nothing to drive and would be unreachable.
Registration sets VM_UFFD_RWP on the VMA. Combining MODE_WP with
MODE_RWP is rejected because both modes claim the uffd PTE bit.
UFFDIO_RWPROTECT is the bidirectional counterpart of
UFFDIO_WRITEPROTECT:
- MODE_RWP change_protection() with MM_CP_UFFD_RWP
installs PAGE_NONE and sets the uffd bit on
present PTEs
- !MODE_RWP change_protection() with MM_CP_UFFD_RWP_RESOLVE
restores vma->vm_page_prot and clears the bit
userfaultfd_clear_vma() runs the same resolve pass on unregister so
RWP state cannot outlive the uffd.
Re-registering a range must not drop a mode that installs per-PTE
markers (WP or RWP); doing so returns -EBUSY. This also closes a
pre-existing window where re-registering without MODE_WP would strand
uffd-wp markers: before, those caused extra write-faults but were
otherwise benign; with RWP preservation in place, a subsequent
mprotect() on a VM_UFFD_RWP VMA would silently promote the stale
markers to RWP.
The feature is not yet advertised. UFFDIO_REGISTER_MODE_RWP,
UFFD_FEATURE_RWP, and _UFFDIO_RWPROTECT are intentionally absent from
UFFD_API_REGISTER_MODES, UFFD_API_FEATURES, and UFFD_API_RANGE_IOCTLS,
so UFFDIO_API masks them out and the register-mode validator rejects
the bit. The follow-up patch adds fault dispatch and exposes the UAPI.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
Documentation/admin-guide/mm/userfaultfd.rst | 10 +
include/linux/userfaultfd_k.h | 2 +
include/uapi/linux/userfaultfd.h | 19 ++
mm/Kconfig | 9 +
mm/userfaultfd.c | 189 ++++++++++++++++++-
5 files changed, 226 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index e5cc8848dcb3..1e533639fd50 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -131,6 +131,16 @@ userfaults on the range registered. Not all ioctls will necessarily be
supported for all memory types (e.g. anonymous memory vs. shmem vs.
hugetlbfs), or all types of intercepted faults.
+.. note::
+
+ Re-registering an already-registered range must not drop any of the
+ modes that install per-PTE markers — currently
+ ``UFFDIO_REGISTER_MODE_WP`` and ``UFFDIO_REGISTER_MODE_RWP``. Doing
+ so would strand markers with no flag to describe them, so the call
+ is rejected with ``-EBUSY``; userspace must issue
+ ``UFFDIO_UNREGISTER`` first. This differs from older kernels, which
+ silently replaced the mode bits on re-registration.
+
Userland can use the ``uffdio_register.ioctls`` to manage the virtual
address space in the background (to add or potentially also remove
memory from the ``userfaultfd`` registered range). This means a userfault
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 564eb2aac321..28fc44733302 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -150,6 +150,8 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_at
extern long uffd_wp_range(struct vm_area_struct *vma,
unsigned long start, unsigned long len, bool enable_wp);
+extern int mrwprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
+ unsigned long len, bool enable_rwp);
/* move_pages */
void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 2841e4ea8f2c..7b78aa3b5318 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -79,6 +79,7 @@
#define _UFFDIO_WRITEPROTECT (0x06)
#define _UFFDIO_CONTINUE (0x07)
#define _UFFDIO_POISON (0x08)
+#define _UFFDIO_RWPROTECT (0x09)
#define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */
@@ -103,6 +104,8 @@
struct uffdio_continue)
#define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \
struct uffdio_poison)
+#define UFFDIO_RWPROTECT _IOWR(UFFDIO, _UFFDIO_RWPROTECT, \
+ struct uffdio_rwprotect)
/* read() structure */
struct uffd_msg {
@@ -158,6 +161,7 @@ struct uffd_msg {
#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
#define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */
#define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */
+#define UFFD_PAGEFAULT_FLAG_RWP (1<<3) /* If reason is VM_UFFD_RWP */
struct uffdio_api {
/* userland asks for an API number and the features to enable */
@@ -230,6 +234,11 @@ struct uffdio_api {
*
* UFFD_FEATURE_MOVE indicates that the kernel supports moving an
* existing page contents from userspace.
+ *
+ * UFFD_FEATURE_RWP indicates that the kernel supports
+ * UFFDIO_REGISTER_MODE_RWP for read-write protection tracking.
+ * Pages are made inaccessible via UFFDIO_RWPROTECT and faults
+ * are delivered when the pages are re-accessed.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -248,6 +257,7 @@ struct uffdio_api {
#define UFFD_FEATURE_POISON (1<<14)
#define UFFD_FEATURE_WP_ASYNC (1<<15)
#define UFFD_FEATURE_MOVE (1<<16)
+#define UFFD_FEATURE_RWP (1<<17)
__u64 features;
__u64 ioctls;
@@ -263,6 +273,7 @@ struct uffdio_register {
#define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
#define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
#define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2)
+#define UFFDIO_REGISTER_MODE_RWP ((__u64)1<<3)
__u64 mode;
/*
@@ -356,6 +367,14 @@ struct uffdio_poison {
__s64 updated;
};
+struct uffdio_rwprotect {
+ struct uffdio_range range;
+ /* !RWP means undo RWP-protection */
+#define UFFDIO_RWPROTECT_MODE_RWP ((__u64)1<<0)
+#define UFFDIO_RWPROTECT_MODE_DONTWAKE ((__u64)1<<1)
+ __u64 mode;
+};
+
struct uffdio_move {
__u64 dst;
__u64 src;
diff --git a/mm/Kconfig b/mm/Kconfig
index 776b67c66e82..fac01bcfc0d1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1333,6 +1333,15 @@ config HAVE_ARCH_USERFAULTFD_MINOR
help
Arch has userfaultfd minor fault support
+config USERFAULTFD_RWP
+ def_bool y
+ depends on 64BIT && ARCH_HAS_PTE_PROTNONE && HAVE_ARCH_USERFAULTFD_WP
+ help
+ Userfaultfd read-write protection (UFFDIO_RWPROTECT) delivers a
+ userfaultfd notification on every access -- read or write -- to a
+ protected range, letting userspace observe the working set of a
+ process.
+
menuconfig USERFAULTFD
bool "Enable userfaultfd() system call"
depends on MMU
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e30878e4e00b..c07e3232a01a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1157,6 +1157,75 @@ static int mwriteprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
return err;
}
+int mrwprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
+ unsigned long len, bool enable_rwp)
+{
+ struct mm_struct *dst_mm = ctx->mm;
+ unsigned long end = start + len;
+ struct vm_area_struct *dst_vma;
+ unsigned int mm_cp_flags;
+ struct mmu_gather tlb;
+ bool found = false;
+ VMA_ITERATOR(vmi, dst_mm, start);
+
+ VM_WARN_ON_ONCE(start & ~PAGE_MASK);
+ VM_WARN_ON_ONCE(len & ~PAGE_MASK);
+ VM_WARN_ON_ONCE(start + len <= start);
+
+ guard(mmap_read_lock)(dst_mm);
+ guard(rwsem_read)(&ctx->map_changing_lock);
+
+ if (atomic_read(&ctx->mmap_changing))
+ return -EAGAIN;
+
+ if (enable_rwp)
+ mm_cp_flags = MM_CP_UFFD_RWP;
+ else
+ mm_cp_flags = MM_CP_UFFD_RWP_RESOLVE;
+
+ /*
+ * Pre-scan the range: validate every spanned VMA before applying
+ * any change_protection() so a partial failure cannot leave the
+ * process with only a prefix of the range re-protected.
+ */
+ for_each_vma_range(vmi, dst_vma, end) {
+ if (!userfaultfd_rwp(dst_vma))
+ return -ENOENT;
+
+ if (is_vm_hugetlb_page(dst_vma)) {
+ unsigned long page_mask;
+
+ page_mask = vma_kernel_pagesize(dst_vma) - 1;
+ if ((start & page_mask) || (len & page_mask))
+ return -EINVAL;
+ }
+ found = true;
+ }
+ if (!found)
+ return -ENOENT;
+
+ vma_iter_set(&vmi, start);
+ tlb_gather_mmu(&tlb, dst_mm);
+ for_each_vma_range(vmi, dst_vma, end) {
+ unsigned long vma_start = max(dst_vma->vm_start, start);
+ unsigned long vma_end = min(dst_vma->vm_end, end);
+ unsigned int flags = mm_cp_flags;
+
+ /*
+ * On resolve, try to upgrade writability per-VMA --
+ * MM_CP_TRY_CHANGE_WRITABLE WARNs in
+ * maybe_change_pte_writable() if the VMA is not VM_WRITE,
+ * and RWP can be registered on PROT_READ-only mappings.
+ */
+ if (!enable_rwp && vma_wants_manual_pte_write_upgrade(dst_vma))
+ flags |= MM_CP_TRY_CHANGE_WRITABLE;
+
+ change_protection(&tlb, dst_vma, vma_start, vma_end, flags);
+ }
+ tlb_finish_mmu(&tlb);
+
+ return 0;
+}
void double_pt_lock(spinlock_t *ptl1,
spinlock_t *ptl2)
@@ -2145,6 +2214,15 @@ static bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
!vma_is_anonymous(vma))
return false;
+ /*
+ * RWP uses protnone as an access-tracking marker. PROT_NONE VMAs
+ * have vm_page_prot == PAGE_NONE, so RWP resolution can't make a
+ * page accessible -- the next access would fault again. Reject up
+ * front instead of letting FOLL_FORCE loop on protnone+uffd PTEs.
+ */
+ if ((vm_flags & VM_UFFD_RWP) && !vma_is_accessible(vma))
+ return false;
+
return ops->can_userfault(vma, vm_flags);
}
@@ -2197,9 +2275,22 @@ static struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi,
if (start == vma->vm_start && end == vma->vm_end)
give_up_on_oom = true;
- /* Reset ptes for the whole vma range if wr-protected */
- if (userfaultfd_wp(vma))
- uffd_wp_range(vma, start, end - start, false);
+ /* Clear the uffd bit and/or restore protnone PTEs */
+ if (userfaultfd_protected(vma)) {
+ unsigned int mm_cp_flags = 0;
+ struct mmu_gather tlb;
+
+ if (userfaultfd_wp(vma))
+ mm_cp_flags |= MM_CP_UFFD_WP_RESOLVE;
+ if (userfaultfd_rwp(vma))
+ mm_cp_flags |= MM_CP_UFFD_RWP_RESOLVE;
+ if (vma_wants_manual_pte_write_upgrade(vma))
+ mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
+
+ tlb_gather_mmu(&tlb, vma->vm_mm);
+ change_protection(&tlb, vma, start, end, mm_cp_flags);
+ tlb_finish_mmu(&tlb);
+ }
ret = vma_modify_flags_uffd(vmi, prev, vma, start, end,
&new_vma_flags, NULL_VM_UFFD_CTX,
@@ -2248,6 +2339,14 @@ static int userfaultfd_register_range(struct userfaultfd_ctx *ctx,
vma_test_all_mask(vma, vma_flags))
goto skip;
+ /*
+ * Pre-scan in userfaultfd_register() already rejected mode
+ * switches that would drop VM_UFFD_WP or VM_UFFD_RWP, so a
+ * stray bit here is a bug.
+ */
+ VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx == ctx &&
+ vma->vm_flags & (VM_UFFD_WP | VM_UFFD_RWP) & ~vm_flags);
+
if (vma->vm_start > start)
start = vma->vm_start;
vma_end = min(end, vma->vm_end);
@@ -2514,6 +2613,8 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
if (reason & VM_UFFD_WP)
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
+ if (reason & VM_UFFD_RWP)
+ msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_RWP;
if (reason & VM_UFFD_MINOR)
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
if (features & UFFD_FEATURE_THREAD_ID)
@@ -3613,6 +3714,22 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vm_flags |= VM_UFFD_WP;
}
+ if (uffdio_register.mode & UFFDIO_REGISTER_MODE_RWP) {
+ if (!pgtable_supports_uffd() || VM_UFFD_RWP == VM_NONE)
+ goto out;
+ if (!(ctx->features & UFFD_FEATURE_RWP))
+ goto out;
+ vm_flags |= VM_UFFD_RWP;
+ }
+
+ /*
+ * WP and RWP share the uffd PTE bit and
+ * cannot coexist in the same VMA — the bit would carry ambiguous
+ * semantics. Reject the combination up front.
+ */
+ if ((vm_flags & VM_UFFD_WP) && (vm_flags & VM_UFFD_RWP))
+ goto out;
+
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR) {
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
goto out;
@@ -3706,6 +3823,16 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
cur->vm_userfaultfd_ctx.ctx != ctx)
goto out_unlock;
+ /*
+ * Mode switches that drop VM_UFFD_WP or VM_UFFD_RWP would
+ * leave PTE markers without the flag that describes them;
+ * subsequent mprotect() would then promote stale markers
+ * into the other mode. Require an unregister first.
+ */
+ if (cur->vm_userfaultfd_ctx.ctx == ctx &&
+ cur->vm_flags & (VM_UFFD_WP | VM_UFFD_RWP) & ~vm_flags)
+ goto out_unlock;
+
/*
* Note vmas containing huge pages
*/
@@ -3739,6 +3866,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
+ /* RWPROTECT is only supported for RWP ranges */
+ if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_RWP))
+ ioctls_out &= ~((__u64)1 << _UFFDIO_RWPROTECT);
+
/*
* Now that we scanned all vmas we can already tell
* userland which ioctls methods are guaranteed to
@@ -4086,6 +4217,55 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
return ret;
}
+static int userfaultfd_rwprotect(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ int ret;
+ struct uffdio_rwprotect uffdio_rwp;
+ struct userfaultfd_wake_range range;
+ bool mode_rwp, mode_dontwake;
+
+ if (atomic_read(&ctx->mmap_changing))
+ return -EAGAIN;
+
+ if (copy_from_user(&uffdio_rwp, (void __user *)arg,
+ sizeof(uffdio_rwp)))
+ return -EFAULT;
+
+ ret = validate_range(ctx->mm, uffdio_rwp.range.start,
+ uffdio_rwp.range.len);
+ if (ret)
+ return ret;
+
+ if (uffdio_rwp.mode & ~(UFFDIO_RWPROTECT_MODE_DONTWAKE |
+ UFFDIO_RWPROTECT_MODE_RWP))
+ return -EINVAL;
+
+ mode_rwp = uffdio_rwp.mode & UFFDIO_RWPROTECT_MODE_RWP;
+ mode_dontwake = uffdio_rwp.mode & UFFDIO_RWPROTECT_MODE_DONTWAKE;
+
+ if (mode_rwp && mode_dontwake)
+ return -EINVAL;
+
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mrwprotect_range(ctx, uffdio_rwp.range.start,
+ uffdio_rwp.range.len, mode_rwp);
+ mmput(ctx->mm);
+ } else {
+ return -ESRCH;
+ }
+
+ if (ret)
+ return ret;
+
+ if (!mode_rwp && !mode_dontwake) {
+ range.start = uffdio_rwp.range.start;
+ range.len = uffdio_rwp.range.len;
+ wake_userfault(ctx, &range);
+ }
+ return ret;
+}
+
static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
{
__s64 ret;
@@ -4392,6 +4572,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_POISON:
ret = userfaultfd_poison(ctx, arg);
break;
+ case UFFDIO_RWPROTECT:
+ ret = userfaultfd_rwprotect(ctx, arg);
+ break;
}
return ret;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 13/18] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (11 preceding siblings ...)
2026-05-26 13:05 ` [PATCH v5 12/18] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau
@ 2026-05-26 13:05 ` Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 14/18] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau
` (4 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:05 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Wire the fault side of read-write protection tracking and turn the
userspace interface on.
An RWP-protected PTE is PAGE_NONE with the uffd bit set. The
PROT_NONE triggers a fault on any access; the uffd bit distinguishes
it from plain mprotect(PROT_NONE) or NUMA hinting.
Fault dispatch, per level:
PTE handle_pte_fault() -> do_uffd_rwp()
PMD __handle_mm_fault() -> do_huge_pmd_uffd_rwp()
hugetlb hugetlb_fault() -> hugetlb_handle_userfault()
The RWP branches gate on userfaultfd_pte_rwp() / userfaultfd_huge_pmd_rwp()
(VM_UFFD_RWP plus the uffd bit) and fall through to do_numa_page() /
do_huge_pmd_numa_page() otherwise. Each delivers a
UFFD_PAGEFAULT_FLAG_RWP message through handle_userfault(); the handler
resolves it with UFFDIO_RWPROTECT clearing MODE_RWP.
userfaultfd_must_wait() and userfaultfd_huge_must_wait() add matching
protnone+uffd waiters so sync-mode fault handlers block correctly.
Expose the UAPI:
UFFDIO_REGISTER_MODE_RWP -> UFFD_API_REGISTER_MODES
UFFD_FEATURE_RWP -> UFFD_API_FEATURES
_UFFDIO_RWPROTECT -> UFFD_API_RANGE_IOCTLS
UFFD_API_RANGE_IOCTLS_BASIC
UFFD_FEATURE_RWP is masked out at UFFDIO_API time when PROT_NONE is
not available or VM_UFFD_RWP aliases VM_NONE (32-bit), so userspace
never sees an advertised-but-broken feature.
Works on anonymous, shmem, and hugetlb memory.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/huge_mm.h | 7 +++++++
include/linux/userfaultfd_k.h | 24 ++++++++++++++++++++++++
include/uapi/linux/userfaultfd.h | 12 ++++++++----
mm/huge_memory.c | 5 +++++
mm/hugetlb.c | 11 +++++++++++
mm/memory.c | 29 +++++++++++++++++++++++++++--
mm/userfaultfd.c | 32 ++++++++++++++++++++++++++++++--
7 files changed, 112 insertions(+), 8 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index edece3e26985..fe48d76957fb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -529,6 +529,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf);
+
vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
extern struct folio *huge_zero_folio;
@@ -716,6 +718,11 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
return NULL;
}
+static inline vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf)
+{
+ return 0;
+}
+
static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
{
return 0;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 28fc44733302..332fad1560ec 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -233,6 +233,18 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
return userfaultfd_wp(vma) && pmd_uffd(pmd);
}
+static inline bool userfaultfd_pte_rwp(struct vm_area_struct *vma,
+ pte_t pte)
+{
+ return userfaultfd_rwp(vma) && pte_uffd(pte);
+}
+
+static inline bool userfaultfd_huge_pmd_rwp(struct vm_area_struct *vma,
+ pmd_t pmd)
+{
+ return userfaultfd_rwp(vma) && pmd_uffd(pmd);
+}
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return vma->vm_flags & __VM_UFFD_FLAGS;
@@ -363,6 +375,18 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
return false;
}
+static inline bool userfaultfd_pte_rwp(struct vm_area_struct *vma,
+ pte_t pte)
+{
+ return false;
+}
+
+static inline bool userfaultfd_huge_pmd_rwp(struct vm_area_struct *vma,
+ pmd_t pmd)
+{
+ return false;
+}
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return false;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 7b78aa3b5318..d803e76d47ad 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -25,7 +25,8 @@
#define UFFD_API ((__u64)0xAA)
#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
UFFDIO_REGISTER_MODE_WP | \
- UFFDIO_REGISTER_MODE_MINOR)
+ UFFDIO_REGISTER_MODE_MINOR | \
+ UFFDIO_REGISTER_MODE_RWP)
#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
UFFD_FEATURE_EVENT_FORK | \
UFFD_FEATURE_EVENT_REMAP | \
@@ -42,7 +43,8 @@
UFFD_FEATURE_WP_UNPOPULATED | \
UFFD_FEATURE_POISON | \
UFFD_FEATURE_WP_ASYNC | \
- UFFD_FEATURE_MOVE)
+ UFFD_FEATURE_MOVE | \
+ UFFD_FEATURE_RWP)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -54,13 +56,15 @@
(__u64)1 << _UFFDIO_MOVE | \
(__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_CONTINUE | \
- (__u64)1 << _UFFDIO_POISON)
+ (__u64)1 << _UFFDIO_POISON | \
+ (__u64)1 << _UFFDIO_RWPROTECT)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_CONTINUE | \
- (__u64)1 << _UFFDIO_POISON)
+ (__u64)1 << _UFFDIO_POISON | \
+ (__u64)1 << _UFFDIO_RWPROTECT)
/*
* Valid ioctl command number range with this API is from 0x00 to
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6417d883d2e4..72cb44332004 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2289,6 +2289,11 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
return pmd_dirty(pmd);
}
+vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf)
+{
+ return handle_userfault(vmf, VM_UFFD_RWP);
+}
+
/* NUMA hinting page fault entry point for trans huge pmds */
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
{
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0d8d39cd8888..d4da39d698b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6062,6 +6062,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mutex;
}
+ /*
+ * Protnone hugetlb PTEs with the uffd bit are used by
+ * userfaultfd RWP for access tracking. Plain PROT_NONE (without the
+ * marker) is not an RWP fault and is not expected on hugetlb (no
+ * NUMA hinting), so let normal hugetlb fault handling proceed.
+ */
+ if (pte_protnone(vmf.orig_pte) && vma_is_accessible(vma) &&
+ userfaultfd_rwp(vma) && huge_pte_uffd(vmf.orig_pte)) {
+ return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP);
+ }
+
/*
* If we are going to COW/unshare the mapping later, we examine the
* pending reservations for this page now. This will ensure that any
diff --git a/mm/memory.c b/mm/memory.c
index 06473285c0dc..111fdae14120 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6122,6 +6122,14 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
if (!pte_present(ptent) || !pte_protnone(ptent))
continue;
+ /*
+ * RWP-armed PTEs are also protnone but carry _PAGE_UFFD as a
+ * marker. Leave them alone -- rewriting to vm_page_prot would
+ * stop the RWP trap.
+ */
+ if (pte_uffd(ptent))
+ continue;
+
if (pfn_folio(pte_pfn(ptent)) != folio)
continue;
@@ -6137,6 +6145,12 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
}
}
+static vm_fault_t do_uffd_rwp(struct vm_fault *vmf)
+{
+ pte_unmap(vmf->pte);
+ return handle_userfault(vmf, VM_UFFD_RWP);
+}
+
static vm_fault_t do_numa_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
@@ -6412,8 +6426,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
- if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+ if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+ /*
+ * RWP-protected PTEs are protnone plus the uffd bit. On a
+ * VM_UFFD_RWP VMA, a protnone PTE without the uffd bit is
+ * NUMA hinting and must still fall through to do_numa_page().
+ */
+ if (userfaultfd_pte_rwp(vmf->vma, vmf->orig_pte))
+ return do_uffd_rwp(vmf);
return do_numa_page(vmf);
+ }
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
@@ -6527,8 +6549,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return 0;
}
if (pmd_trans_huge(vmf.orig_pmd)) {
- if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+ if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
+ if (userfaultfd_huge_pmd_rwp(vma, vmf.orig_pmd))
+ return do_huge_pmd_uffd_rwp(&vmf);
return do_huge_pmd_numa_page(&vmf);
+ }
if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
!pmd_write(vmf.orig_pmd)) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c07e3232a01a..db3707b9d977 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2668,6 +2668,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
*/
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
return true;
+ /*
+ * PTE is still RW-protected (protnone with uffd bit), wait for
+ * resolution. Plain PROT_NONE without the marker is not an RWP fault.
+ */
+ if (pte_protnone(pte) && huge_pte_uffd(pte) && (reason & VM_UFFD_RWP))
+ return true;
return false;
}
@@ -2728,8 +2734,14 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
if (!pmd_present(_pmd))
return false;
- if (pmd_trans_huge(_pmd))
- return !pmd_write(_pmd) && (reason & VM_UFFD_WP);
+ if (pmd_trans_huge(_pmd)) {
+ if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
+ return true;
+ if (pmd_protnone(_pmd) && pmd_uffd(_pmd) &&
+ (reason & VM_UFFD_RWP))
+ return true;
+ return false;
+ }
pte = pte_offset_map(pmd, address);
if (!pte)
@@ -2765,6 +2777,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
*/
if (!pte_write(ptent) && (reason & VM_UFFD_WP))
goto out;
+ /*
+ * PTE is still RW-protected (protnone with uffd bit), wait for
+ * userspace to resolve. Plain PROT_NONE without the marker is not
+ * an RWP fault.
+ */
+ if (pte_protnone(ptent) && pte_uffd(ptent) && (reason & VM_UFFD_RWP))
+ goto out;
ret = false;
out:
@@ -4506,6 +4525,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC;
}
+ /*
+ * RWP needs both PROT_NONE support and the uffd-wp PTE bit. The
+ * VM_UFFD_RWP check covers compile-time unavailability; the
+ * pgtable_supports_uffd() check covers runtime (e.g. riscv
+ * without the SVRSW60T59B extension) where the PTE bit is declared
+ * but not actually usable.
+ */
+ if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
+ uffdio_api.features &= ~UFFD_FEATURE_RWP;
ret = -EINVAL;
if (features & ~uffdio_api.features)
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 14/18] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (12 preceding siblings ...)
2026-05-26 13:05 ` [PATCH v5 13/18] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau
@ 2026-05-26 13:05 ` Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 15/18] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau
` (3 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:05 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
PAGEMAP_SCAN already reports PAGE_IS_WRITTEN from the inverted uffd
PTE bit, targeting the UFFDIO_WRITEPROTECT workflow. UFFDIO_RWPROTECT
reuses the same PTE bit as a marker for read-write protection, but
"has been written" and "has been accessed" are distinct semantic
signals — they happen to share one PTE bit today only because the two
implementations share infrastructure.
Give RWP its own pagemap category so the UAPI does not conflate them:
PAGE_IS_WRITTEN reported on VM_UFFD_WP VMAs, !pte_uffd(pte)
PAGE_IS_ACCESSED reported on VM_UFFD_RWP VMAs, !pte_uffd(pte)
Both still read the same PTE bit today, but each is scoped to the VMA
whose registered mode makes the bit meaningful. If a future
implementation moves RWP to a separate PTE bit, only PAGE_IS_ACCESSED
switches over.
This is a UAPI narrowing. Outside VM_UFFD_WP VMAs the uffd bit is
always clear, so PAGEMAP_SCAN used to flag PAGE_IS_WRITTEN on every
present PTE there — a meaningless duplicate of PAGE_IS_PRESENT. Now
PAGE_IS_WRITTEN fires only inside VM_UFFD_WP VMAs.
pagemap_hugetlb_category() now takes the vma like its PTE/PMD peers.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
Documentation/admin-guide/mm/pagemap.rst | 13 +++-
fs/proc/task_mmu.c | 75 ++++++++++++++++++------
include/uapi/linux/fs.h | 1 +
tools/include/uapi/linux/fs.h | 1 +
4 files changed, 69 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index c57e61b5d8aa..ffa690a171c8 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -19,8 +19,11 @@ There are four components to pagemap:
* Bit 55 pte is soft-dirty (see
Documentation/admin-guide/mm/soft-dirty.rst)
* Bit 56 page exclusively mapped (since 4.2)
- * Bit 57 pte is uffd-wp write-protected (since 5.13) (see
- Documentation/admin-guide/mm/userfaultfd.rst)
+ * Bit 57 pte is tracked by userfaultfd (since 5.13) — in a
+ ``VM_UFFD_WP`` VMA this indicates a write-protected PTE; in a
+ ``VM_UFFD_RWP`` VMA it indicates an RWP-protected PTE. WP and
+ RWP are mutually exclusive per VMA, so the meaning is
+ unambiguous. See Documentation/admin-guide/mm/userfaultfd.rst.
* Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page)
* Bits 59-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
@@ -244,7 +247,8 @@ in this IOCTL:
Following flags about pages are currently supported:
- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
-- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
+- ``PAGE_IS_WRITTEN`` - Page in a ``UFFDIO_REGISTER_MODE_WP`` VMA has been
+ written to since it was write-protected. Only reported inside such VMAs.
- ``PAGE_IS_FILE`` - Page is file backed
- ``PAGE_IS_PRESENT`` - Page is present in the memory
- ``PAGE_IS_SWAPPED`` - Page is in swapped
@@ -252,6 +256,9 @@ Following flags about pages are currently supported:
- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
- ``PAGE_IS_GUARD`` - Page is a part of a guard region
+- ``PAGE_IS_ACCESSED`` - Page in a ``UFFDIO_REGISTER_MODE_RWP`` VMA has been
+ accessed since RWP was applied. Only reported inside such VMAs. See
+ Documentation/admin-guide/mm/userfaultfd.rst for the RWP workflow.
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 974c5f4aa533..0db29c3a8639 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2284,7 +2284,7 @@ static const struct mm_walk_ops pagemap_ops = {
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst)
* Bit 56 page exclusively mapped
- * Bit 57 pte is uffd-wp write-protected
+ * Bit 57 pte is tracked by userfaultfd (uffd-wp or RWP)
* Bit 58 pte is a guard region
* Bits 59-60 zero
* Bit 61 page is file-page or shared-anon
@@ -2419,7 +2419,7 @@ static int pagemap_release(struct inode *inode, struct file *file)
PAGE_IS_FILE | PAGE_IS_PRESENT | \
PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \
PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY | \
- PAGE_IS_GUARD)
+ PAGE_IS_GUARD | PAGE_IS_ACCESSED)
#define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
struct pagemap_scan_private {
@@ -2444,8 +2444,12 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_PRESENT;
- if (!pte_uffd(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!pte_uffd(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (p->masks_of_interest & PAGE_IS_FILE) {
page = vm_normal_page(vma, addr, pte);
@@ -2462,8 +2466,12 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
categories = PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_any(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!pte_swp_uffd_any(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
entry = softleaf_from_pte(pte);
if (softleaf_is_guard_marker(entry))
@@ -2512,8 +2520,12 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
struct page *page;
categories |= PAGE_IS_PRESENT;
- if (!pmd_uffd(pmd))
- categories |= PAGE_IS_WRITTEN;
+ if (!pmd_uffd(pmd)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (p->masks_of_interest & PAGE_IS_FILE) {
page = vm_normal_page_pmd(vma, addr, pmd);
@@ -2527,8 +2539,12 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
categories |= PAGE_IS_SOFT_DIRTY;
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pmd_swp_uffd(pmd))
- categories |= PAGE_IS_WRITTEN;
+ if (!pmd_swp_uffd(pmd)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (pmd_swp_soft_dirty(pmd))
categories |= PAGE_IS_SOFT_DIRTY;
@@ -2561,7 +2577,8 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#ifdef CONFIG_HUGETLB_PAGE
-static unsigned long pagemap_hugetlb_category(pte_t pte)
+static unsigned long pagemap_hugetlb_category(struct vm_area_struct *vma,
+ pte_t pte)
{
unsigned long categories = PAGE_IS_HUGE;
@@ -2576,8 +2593,12 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
if (pte_present(pte)) {
categories |= PAGE_IS_PRESENT;
- if (!huge_pte_uffd(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!huge_pte_uffd(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (!PageAnon(pte_page(pte)))
categories |= PAGE_IS_FILE;
if (is_zero_pfn(pte_pfn(pte)))
@@ -2587,8 +2608,12 @@ static unsigned long pagemap_hugetlb_category(pte_t pte)
} else {
categories |= PAGE_IS_SWAPPED;
- if (!pte_swp_uffd_any(pte))
- categories |= PAGE_IS_WRITTEN;
+ if (!pte_swp_uffd_any(pte)) {
+ if (userfaultfd_wp(vma))
+ categories |= PAGE_IS_WRITTEN;
+ if (userfaultfd_rwp(vma))
+ categories |= PAGE_IS_ACCESSED;
+ }
if (pte_swp_soft_dirty(pte))
categories |= PAGE_IS_SOFT_DIRTY;
}
@@ -2677,6 +2702,18 @@ static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
bool wp_allowed = userfaultfd_wp_async(vma) &&
userfaultfd_wp_use_markers(vma);
+ /*
+ * PM_SCAN_WP_MATCHING is the atomic read-and-reset flavour of the
+ * scan and is implemented for the WP marker only. Silently skip
+ * VM_UFFD_RWP VMAs, matching the convention used below for VMAs
+ * that lack the WP-async capability. Returning -EINVAL here would
+ * abort the walk after preceding VMAs had already been mutated,
+ * destroying the atomic read-and-reset guarantee. Re-arming RWP
+ * is done with UFFDIO_RWPROTECT(MODE_RWP).
+ */
+ if (userfaultfd_rwp(vma) && (p->arg.flags & PM_SCAN_WP_MATCHING))
+ return 1;
+
if (!wp_allowed) {
/* User requested explicit failure over wp-async capability */
if (p->arg.flags & PM_SCAN_CHECK_WPASYNC)
@@ -2864,7 +2901,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
goto flush_and_return;
}
- if (!p->arg.category_anyof_mask && !p->arg.category_inverted &&
+ if (userfaultfd_wp(vma) && !p->arg.category_anyof_mask &&
+ !p->arg.category_inverted &&
p->arg.category_mask == PAGE_IS_WRITTEN &&
p->arg.return_mask == PAGE_IS_WRITTEN) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
@@ -2939,7 +2977,8 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
/* Go the short route when not write-protecting pages. */
pte = huge_ptep_get(walk->mm, start, ptep);
- categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+ categories = p->cur_vma_category |
+ pagemap_hugetlb_category(vma, pte);
if (!pagemap_scan_is_interesting_page(categories, p))
return 0;
@@ -2951,7 +2990,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
pte = huge_ptep_get(walk->mm, start, ptep);
- categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+ categories = p->cur_vma_category | pagemap_hugetlb_category(vma, pte);
if (!pagemap_scan_is_interesting_page(categories, p))
goto out_unlock;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 13f71202845e..c4aeaa0c31c7 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -455,6 +455,7 @@ typedef int __bitwise __kernel_rwf_t;
#define PAGE_IS_HUGE (1 << 6)
#define PAGE_IS_SOFT_DIRTY (1 << 7)
#define PAGE_IS_GUARD (1 << 8)
+#define PAGE_IS_ACCESSED (1 << 9)
/*
* struct page_region - Page region with flags
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index 24ddf7bc4f25..f0a26309b6d5 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -364,6 +364,7 @@ typedef int __bitwise __kernel_rwf_t;
#define PAGE_IS_HUGE (1 << 6)
#define PAGE_IS_SOFT_DIRTY (1 << 7)
#define PAGE_IS_GUARD (1 << 8)
+#define PAGE_IS_ACCESSED (1 << 9)
/*
* struct page_region - Page region with flags
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 15/18] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (13 preceding siblings ...)
2026-05-26 13:05 ` [PATCH v5 14/18] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau
@ 2026-05-26 13:05 ` Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 16/18] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau
` (2 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:05 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Sync RWP delivers a message and blocks the faulting thread until the
handler resolves the fault. For working-set tracking the VMM does not
need the message: it just needs to know, at scan time, which pages
were touched. Async RWP serves that use case — the kernel restores
access in-place and the faulting thread continues without blocking.
The VMM reconstructs the access pattern after the fact via
PAGEMAP_SCAN: pages whose uffd bit is still set (inverted
PAGE_IS_ACCESSED) were not re-accessed since the last RWP cycle.
Worth calling out: async resolution upgrades writable private anon
PTEs via pte_mkwrite() when can_change_pte_writable() allows, mirroring
do_numa_page(). Without it, every re-access of an RWP'd writable page
would COW-fault a second time.
UFFD_FEATURE_RWP_ASYNC requires UFFD_FEATURE_RWP.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/userfaultfd_k.h | 6 ++++++
include/uapi/linux/userfaultfd.h | 11 ++++++++++-
mm/huge_memory.c | 25 ++++++++++++++++++++++++-
mm/hugetlb.c | 32 +++++++++++++++++++++++++++++++-
mm/memory.c | 27 +++++++++++++++++++++++++--
mm/userfaultfd.c | 19 ++++++++++++++++++-
6 files changed, 114 insertions(+), 6 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 332fad1560ec..87386b79049e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -278,6 +278,7 @@ extern void userfaultfd_unmap_complete(struct mm_struct *mm,
struct list_head *uf);
extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma);
extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
+extern bool userfaultfd_rwp_async(struct vm_area_struct *vma);
static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
{
@@ -456,6 +457,11 @@ static inline bool userfaultfd_wp_async(struct vm_area_struct *vma)
return false;
}
+static inline bool userfaultfd_rwp_async(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
{
return false;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index d803e76d47ad..c10f08f8a618 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -44,7 +44,8 @@
UFFD_FEATURE_POISON | \
UFFD_FEATURE_WP_ASYNC | \
UFFD_FEATURE_MOVE | \
- UFFD_FEATURE_RWP)
+ UFFD_FEATURE_RWP | \
+ UFFD_FEATURE_RWP_ASYNC)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -243,6 +244,13 @@ struct uffdio_api {
* UFFDIO_REGISTER_MODE_RWP for read-write protection tracking.
* Pages are made inaccessible via UFFDIO_RWPROTECT and faults
* are delivered when the pages are re-accessed.
+ *
+ * UFFD_FEATURE_RWP_ASYNC indicates asynchronous mode for
+ * UFFDIO_REGISTER_MODE_RWP. When set, faults on read-write
+ * protected pages are auto-resolved by the kernel (PTE
+ * permissions restored immediately) without delivering a message
+ * to the userfaultfd handler. Use PAGEMAP_SCAN with inverted
+ * PAGE_IS_ACCESSED to find pages that were not re-accessed.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -262,6 +270,7 @@ struct uffdio_api {
#define UFFD_FEATURE_WP_ASYNC (1<<15)
#define UFFD_FEATURE_MOVE (1<<16)
#define UFFD_FEATURE_RWP (1<<17)
+#define UFFD_FEATURE_RWP_ASYNC (1<<18)
__u64 features;
__u64 ioctls;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 72cb44332004..8f120452d995 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2291,7 +2291,30 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf)
{
- return handle_userfault(vmf, VM_UFFD_RWP);
+ struct vm_area_struct *vma = vmf->vma;
+ pmd_t pmd;
+
+ if (!userfaultfd_rwp_async(vma))
+ return handle_userfault(vmf, VM_UFFD_RWP);
+
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) {
+ spin_unlock(vmf->ptl);
+ return 0;
+ }
+ pmd = pmd_modify(vmf->orig_pmd, vma->vm_page_prot);
+ /* pmd_modify() preserves _PAGE_UFFD; drop it on resolution */
+ pmd = pmd_clear_uffd(pmd);
+ pmd = pmd_mkyoung(pmd);
+ if (!pmd_write(pmd) &&
+ vma_wants_manual_pte_write_upgrade(vma) &&
+ can_change_pmd_writable(vma, vmf->address, pmd))
+ pmd = pmd_mkwrite(pmd, vma);
+ set_pmd_at(vma->vm_mm, vmf->address & HPAGE_PMD_MASK,
+ vmf->pmd, pmd);
+ update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+ spin_unlock(vmf->ptl);
+ return 0;
}
/* NUMA hinting page fault entry point for trans huge pmds */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d4da39d698b8..9da52d95b3fb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6070,7 +6070,37 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (pte_protnone(vmf.orig_pte) && vma_is_accessible(vma) &&
userfaultfd_rwp(vma) && huge_pte_uffd(vmf.orig_pte)) {
- return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP);
+ spinlock_t *ptl;
+ pte_t pte;
+
+ /* Sync: drop hugetlb locks before blocking in handle_userfault() */
+ if (!userfaultfd_rwp_async(vma))
+ return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP);
+
+ ptl = huge_pte_lock(h, mm, vmf.pte);
+ pte = huge_ptep_get(mm, vmf.address, vmf.pte);
+ if (pte_protnone(pte) && huge_pte_uffd(pte)) {
+ unsigned int shift = huge_page_shift(h);
+
+ pte = huge_pte_modify(pte, vma->vm_page_prot);
+ pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ /* huge_pte_modify() preserves _PAGE_UFFD; drop it on resolution */
+ pte = huge_pte_clear_uffd(pte);
+ pte = pte_mkyoung(pte);
+ /*
+ * Unlike do_uffd_rwp(), do not upgrade to writable
+ * here. Hugetlb lacks a can_change_huge_pte_writable()
+ * equivalent, so a write access will take a separate
+ * COW fault — acceptable for the rare private hugetlb
+ * case.
+ */
+ set_huge_pte_at(mm, vmf.address, vmf.pte, pte,
+ huge_page_size(h));
+ update_mmu_cache(vma, vmf.address, vmf.pte);
+ }
+ spin_unlock(ptl);
+ ret = 0;
+ goto out_mutex;
}
/*
diff --git a/mm/memory.c b/mm/memory.c
index 111fdae14120..5f56dcc2f265 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6147,8 +6147,31 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
static vm_fault_t do_uffd_rwp(struct vm_fault *vmf)
{
- pte_unmap(vmf->pte);
- return handle_userfault(vmf, VM_UFFD_RWP);
+ pte_t pte;
+
+ if (!userfaultfd_rwp_async(vmf->vma)) {
+ /* Sync mode: unmap PTE and deliver to userfaultfd handler */
+ pte_unmap(vmf->pte);
+ return handle_userfault(vmf, VM_UFFD_RWP);
+ }
+
+ spin_lock(vmf->ptl);
+ if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
+ pte = pte_modify(vmf->orig_pte, vmf->vma->vm_page_prot);
+ /* pte_modify() preserves _PAGE_UFFD; drop it on resolution */
+ pte = pte_clear_uffd(pte);
+ pte = pte_mkyoung(pte);
+ if (!pte_write(pte) &&
+ vma_wants_manual_pte_write_upgrade(vmf->vma) &&
+ can_change_pte_writable(vmf->vma, vmf->address, pte))
+ pte = pte_mkwrite(pte, vmf->vma);
+ set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte);
+ update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}
static vm_fault_t do_numa_page(struct vm_fault *vmf)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index db3707b9d977..f40bf473a6f6 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2487,6 +2487,11 @@ static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx)
return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC);
}
+static bool userfaultfd_rwp_async_ctx(struct userfaultfd_ctx *ctx)
+{
+ return ctx && (ctx->features & UFFD_FEATURE_RWP_ASYNC);
+}
+
/*
* Whether WP_UNPOPULATED is enabled on the uffd context. It is only
* meaningful when userfaultfd_wp()==true on the vma and when it's
@@ -4408,6 +4413,11 @@ bool userfaultfd_wp_async(struct vm_area_struct *vma)
return userfaultfd_wp_async_ctx(vma->vm_userfaultfd_ctx.ctx);
}
+bool userfaultfd_rwp_async(struct vm_area_struct *vma)
+{
+ return userfaultfd_rwp_async_ctx(vma->vm_userfaultfd_ctx.ctx);
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
@@ -4511,6 +4521,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
if (features & UFFD_FEATURE_WP_ASYNC)
features |= UFFD_FEATURE_WP_UNPOPULATED;
+ ret = -EINVAL;
+ /* RWP_ASYNC requires RWP */
+ if ((features & UFFD_FEATURE_RWP_ASYNC) &&
+ !(features & UFFD_FEATURE_RWP))
+ goto err_out;
+
/* report all available features and ioctls to userland */
uffdio_api.features = UFFD_API_FEATURES;
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
@@ -4533,7 +4549,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
* but not actually usable.
*/
if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
- uffdio_api.features &= ~UFFD_FEATURE_RWP;
+ uffdio_api.features &=
+ ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC);
ret = -EINVAL;
if (features & ~uffdio_api.features)
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 16/18] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (14 preceding siblings ...)
2026-05-26 13:05 ` [PATCH v5 15/18] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau
@ 2026-05-26 13:05 ` Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 17/18] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 18/18] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:05 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Add an ioctl to toggle async mode at runtime without re-registering
the userfaultfd. This allows a VMM to switch between sync and async
RWP modes on-the-fly -- for example, starting in async mode for
working set scanning, then switching to sync mode to intercept faults
during page eviction.
UFFDIO_SET_MODE takes an enable/disable bitmask of UFFD_FEATURE_*
flags. Only UFFD_FEATURE_RWP_ASYNC is toggleable today; the ioctl
rejects any other bit with -EINVAL. Enabling RWP_ASYNC also requires
RWP to have been negotiated at UFFDIO_API time, mirroring the
UFFDIO_API invariant.
Fault-path readers of ctx->features run under mmap_read_lock or a
per-VMA lock; the RMW takes mmap_write_lock and calls
vma_start_write() on every UFFD-armed VMA, so those readers are fully
excluded. userfaultfd_show_fdinfo(), however, reads ctx->features
without any lock, so the RMW is written as a single WRITE_ONCE and
fdinfo reads it with READ_ONCE. That keeps the lockless observer from
seeing a mid-RMW intermediate and removes the audit burden when new
toggleable bits are added later.
When switching to async, pending sync waiters are woken so they retry
and auto-resolve under the new mode.
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/uapi/linux/userfaultfd.h | 14 +++
mm/userfaultfd.c | 150 +++++++++++++++++++++++++------
2 files changed, 136 insertions(+), 28 deletions(-)
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index c10f08f8a618..cea11aad6b54 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -49,6 +49,7 @@
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
+ (__u64)1 << _UFFDIO_SET_MODE | \
(__u64)1 << _UFFDIO_API)
#define UFFD_API_RANGE_IOCTLS \
((__u64)1 << _UFFDIO_WAKE | \
@@ -85,6 +86,7 @@
#define _UFFDIO_CONTINUE (0x07)
#define _UFFDIO_POISON (0x08)
#define _UFFDIO_RWPROTECT (0x09)
+#define _UFFDIO_SET_MODE (0x0A)
#define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */
@@ -111,6 +113,8 @@
struct uffdio_poison)
#define UFFDIO_RWPROTECT _IOWR(UFFDIO, _UFFDIO_RWPROTECT, \
struct uffdio_rwprotect)
+#define UFFDIO_SET_MODE _IOW(UFFDIO, _UFFDIO_SET_MODE, \
+ struct uffdio_set_mode)
/* read() structure */
struct uffd_msg {
@@ -406,6 +410,16 @@ struct uffdio_move {
__s64 move;
};
+struct uffdio_set_mode {
+ /*
+ * Toggle async mode for features at runtime.
+ * Supported: UFFD_FEATURE_RWP_ASYNC.
+ * Setting a bit in both enable and disable is invalid.
+ */
+ __u64 enable;
+ __u64 disable;
+};
+
/*
* Flags for the userfaultfd(2) system call itself.
*/
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f40bf473a6f6..f172ec14a6c8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2477,19 +2477,29 @@ struct userfaultfd_wake_range {
/* internal indication that UFFD_API ioctl was successfully executed */
#define UFFD_FEATURE_INITIALIZED (1u << 31)
+/*
+ * UFFDIO_SET_MODE updates ctx->features under mmap_write_lock with
+ * WRITE_ONCE; readers that run outside mmap_read_lock or the per-VMA
+ * lock (poll/read_iter/ioctl, fdinfo) must pair with READ_ONCE.
+ */
+static unsigned int userfaultfd_features(struct userfaultfd_ctx *ctx)
+{
+ return READ_ONCE(ctx->features);
+}
+
static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
{
- return ctx->features & UFFD_FEATURE_INITIALIZED;
+ return userfaultfd_features(ctx) & UFFD_FEATURE_INITIALIZED;
}
static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx)
{
- return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC);
+ return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_WP_ASYNC);
}
static bool userfaultfd_rwp_async_ctx(struct userfaultfd_ctx *ctx)
{
- return ctx && (ctx->features & UFFD_FEATURE_RWP_ASYNC);
+ return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_RWP_ASYNC);
}
/*
@@ -2504,7 +2514,7 @@ bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
if (!ctx)
return false;
- return ctx->features & UFFD_FEATURE_WP_UNPOPULATED;
+ return userfaultfd_features(ctx) & UFFD_FEATURE_WP_UNPOPULATED;
}
static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode,
@@ -4290,6 +4300,109 @@ static int userfaultfd_rwprotect(struct userfaultfd_ctx *ctx,
return ret;
}
+/* Subset of UFFD_API_FEATURES actually supported by this kernel/arch */
+static __u64 uffd_api_available_features(void)
+{
+ __u64 f = UFFD_API_FEATURES;
+
+ if (!IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_MINOR))
+ f &= ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
+ if (!pgtable_supports_uffd())
+ f &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ if (!uffd_supports_wp_marker())
+ f &= ~(UFFD_FEATURE_WP_HUGETLBFS_SHMEM |
+ UFFD_FEATURE_WP_UNPOPULATED |
+ UFFD_FEATURE_WP_ASYNC);
+ /*
+ * RWP needs both PROT_NONE support and the uffd PTE bit. The
+ * VM_UFFD_RWP check covers compile-time unavailability; the
+ * pgtable_supports_uffd() check covers runtime (e.g. riscv
+ * without the SVRSW60T59B extension) where the PTE bit is declared
+ * but not actually usable.
+ */
+ if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
+ f &= ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC);
+ return f;
+}
+
+/* Async features that can be toggled at runtime via UFFDIO_SET_MODE */
+#define UFFD_FEATURE_TOGGLEABLE UFFD_FEATURE_RWP_ASYNC
+
+static int userfaultfd_set_mode(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ struct uffdio_set_mode mode;
+ struct mm_struct *mm = ctx->mm;
+
+ if (copy_from_user(&mode, (void __user *)arg, sizeof(mode)))
+ return -EFAULT;
+
+ /* enable and disable must not overlap */
+ if (mode.enable & mode.disable)
+ return -EINVAL;
+
+ /* only toggleable features that this kernel/arch actually supports */
+ if ((mode.enable | mode.disable) &
+ ~(uffd_api_available_features() & UFFD_FEATURE_TOGGLEABLE))
+ return -EINVAL;
+
+ /* RWP_ASYNC can only be enabled on contexts that negotiated RWP */
+ if ((mode.enable & UFFD_FEATURE_RWP_ASYNC) &&
+ !(userfaultfd_features(ctx) & UFFD_FEATURE_RWP))
+ return -EINVAL;
+
+ if (!mmget_not_zero(mm))
+ return -ESRCH;
+
+ /*
+ * Drain in-flight faults before flipping features. mmap_write_lock()
+ * blocks new mmap_read_lock() callers, but per-VMA locked faults
+ * (lock_vma_under_rcu() + FAULT_FLAG_VMA_LOCK) that acquired before
+ * this point keep running. Calling vma_start_write() on each UFFD-
+ * armed VMA waits for those readers to drop, so no in-flight fault
+ * can observe the old features after mmap_write_unlock().
+ */
+ mmap_write_lock(mm);
+ {
+ struct vm_area_struct *vma;
+ VMA_ITERATOR(vmi, mm, 0);
+
+ for_each_vma(vmi, vma) {
+ if (vma->vm_userfaultfd_ctx.ctx == ctx)
+ vma_start_write(vma);
+ }
+ }
+ /*
+ * Single WRITE_ONCE so lockless readers (fdinfo, poll/read_iter
+ * via userfaultfd_is_initialized(), and the userfaultfd_features()
+ * helper used elsewhere) can't observe a mid-RMW intermediate
+ * value. Hot-path readers already serialise through the mmap lock
+ * + vma_start_write() drain above, so their load doesn't need an
+ * annotation.
+ */
+ WRITE_ONCE(ctx->features,
+ (ctx->features | mode.enable) & ~mode.disable);
+ mmap_write_unlock(mm);
+
+ /*
+ * If switching to async, wake threads blocked in handle_userfault().
+ * They will retry the fault and auto-resolve under the new mode.
+ * len=0 means wake all pending faults on this context.
+ */
+ if (mode.enable & UFFD_FEATURE_RWP_ASYNC) {
+ struct userfaultfd_wake_range range = { .len = 0 };
+
+ spin_lock_irq(&ctx->fault_pending_wqh.lock);
+ __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
+ &range);
+ __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range);
+ spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+ }
+
+ mmput(mm);
+ return 0;
+}
+
static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
{
__s64 ret;
@@ -4528,29 +4641,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
goto err_out;
/* report all available features and ioctls to userland */
- uffdio_api.features = UFFD_API_FEATURES;
-#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
- uffdio_api.features &=
- ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
-#endif
- if (!pgtable_supports_uffd())
- uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
-
- if (!uffd_supports_wp_marker()) {
- uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
- uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
- uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC;
- }
- /*
- * RWP needs both PROT_NONE support and the uffd-wp PTE bit. The
- * VM_UFFD_RWP check covers compile-time unavailability; the
- * pgtable_supports_uffd() check covers runtime (e.g. riscv
- * without the SVRSW60T59B extension) where the PTE bit is declared
- * but not actually usable.
- */
- if (VM_UFFD_RWP == VM_NONE || !pgtable_supports_uffd())
- uffdio_api.features &=
- ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC);
+ uffdio_api.features = uffd_api_available_features();
ret = -EINVAL;
if (features & ~uffdio_api.features)
@@ -4620,6 +4711,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_RWPROTECT:
ret = userfaultfd_rwprotect(ctx, arg);
break;
+ case UFFDIO_SET_MODE:
+ ret = userfaultfd_set_mode(ctx, arg);
+ break;
}
return ret;
}
@@ -4647,7 +4741,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
* protocols: aa:... bb:...
*/
seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
- pending, total, UFFD_API, ctx->features,
+ pending, total, UFFD_API, userfaultfd_features(ctx),
UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
}
#endif
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 17/18] selftests/mm: add userfaultfd RWP tests
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (15 preceding siblings ...)
2026-05-26 13:05 ` [PATCH v5 16/18] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau
@ 2026-05-26 13:05 ` Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 18/18] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:05 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Coverage for UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT:
rwp-async async mode — touch pages, verify permissions are
auto-restored without a message
rwp-sync sync mode — access blocks, handler resolves via
UFFDIO_RWPROTECT
rwp-pagemap PAGEMAP_SCAN reports still-cold pages via
inverted PAGE_IS_ACCESSED
rwp-mprotect RWP survives mprotect(PROT_NONE) ->
mprotect(PROT_READ|PROT_WRITE) round-trip
rwp-gup GUP walks through a protnone RWP PTE (pipe
write/read drives the GUP path)
rwp-async-toggle UFFDIO_SET_MODE flips between sync and async
without re-registering
rwp-close closing the uffd restores page permissions
rwp-fork RWP survives fork() with EVENT_FORK; child's
PTEs keep the uffd bit
rwp-fork-pin RWP survives fork() on an RO-longterm-pinned
anon page (forces copy_present_page()); child
read auto-resolves and clears the bit, proving
PAGE_NONE was in place
rwp-wp-exclusive register with MODE_WP|MODE_RWP returns -EINVAL
All tests run against anon, shmem, shmem-private, hugetlb, and
hugetlb-private memory, except rwp-fork-pin which is anon-only —
copy_present_page() is the private-anon pinned-exclusive fork path.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
tools/testing/selftests/mm/uffd-unit-tests.c | 765 +++++++++++++++++++
1 file changed, 765 insertions(+)
diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index a6c14109e818..9f5a5ccf6044 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -7,6 +7,8 @@
#include "uffd-common.h"
+#include <linux/fs.h>
+#include <sys/uio.h>
#include "../../../../mm/gup_test.h"
#ifdef __NR_userfaultfd
@@ -109,6 +111,10 @@ static void uffd_test_skip(const char *message)
static void test_uffd_api(bool use_dev)
{
+ const uint64_t expected_ioctls =
+ BIT_ULL(_UFFDIO_REGISTER) |
+ BIT_ULL(_UFFDIO_UNREGISTER) |
+ BIT_ULL(_UFFDIO_API);
struct uffdio_api uffdio_api;
int uffd;
@@ -148,6 +154,15 @@ static void test_uffd_api(bool use_dev)
goto out;
}
+ /* Verify returned fd-level ioctls bitmask */
+ if ((uffdio_api.ioctls & expected_ioctls) != expected_ioctls) {
+ uffd_test_fail("UFFDIO_API missing expected ioctls: "
+ "got=0x%"PRIx64", expected=0x%"PRIx64,
+ (uint64_t)uffdio_api.ioctls,
+ expected_ioctls);
+ goto out;
+ }
+
/* Test double requests of UFFDIO_API with a random feature set */
uffdio_api.features = BIT_ULL(0);
if (ioctl(uffd, UFFDIO_API, &uffdio_api) == 0) {
@@ -602,6 +617,685 @@ void uffd_minor_collapse_test(uffd_global_test_opts_t *gopts, uffd_test_args_t *
uffd_minor_test_common(gopts, true, false);
}
+static int uffd_register_rwp(int uffd, void *addr, uint64_t len)
+{
+ struct uffdio_register reg = {
+ .range = { .start = (unsigned long)addr, .len = len },
+ .mode = UFFDIO_REGISTER_MODE_RWP,
+ };
+
+ if (ioctl(uffd, UFFDIO_REGISTER, ®) == -1)
+ return -errno;
+ return 0;
+}
+
+static void rwprotect_range(int uffd, __u64 start, __u64 len, bool protect)
+{
+ struct uffdio_rwprotect rwp = {
+ .range = { .start = start, .len = len },
+ .mode = protect ? UFFDIO_RWPROTECT_MODE_RWP : 0,
+ };
+
+ if (ioctl(uffd, UFFDIO_RWPROTECT, &rwp))
+ err("UFFDIO_RWPROTECT failed");
+}
+
+static void set_async_mode(int uffd, bool enable)
+{
+ struct uffdio_set_mode mode = { };
+
+ if (enable)
+ mode.enable = UFFD_FEATURE_RWP_ASYNC;
+ else
+ mode.disable = UFFD_FEATURE_RWP_ASYNC;
+
+ if (ioctl(uffd, UFFDIO_SET_MODE, &mode))
+ err("UFFDIO_SET_MODE failed");
+}
+
+/*
+ * Test async RWP faults on anonymous memory.
+ * Populate pages, register MODE_RWP with RWP_ASYNC,
+ * RW-protect, re-access, verify content preserved and no faults delivered.
+ */
+static void uffd_rwp_async_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+
+ /* Populate all pages with known content */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ /* Register MODE_RWP */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ /* RW-protect all pages (sets protnone) */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Access all pages — should auto-resolve, no faults */
+ for (p = 0; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+ unsigned char expected = p % 255 + 1;
+
+ if (page[0] != expected) {
+ uffd_test_fail("page %lu content mismatch: %u != %u",
+ p, page[0], expected);
+ return;
+ }
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Fault handler for RWP — unprotect the page via UFFDIO_RWPROTECT.
+ */
+static void uffd_handle_rwp_fault(uffd_global_test_opts_t *gopts,
+ struct uffd_msg *msg,
+ struct uffd_args *uargs)
+{
+ if (!(msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_RWP))
+ err("expected RWP fault, got 0x%llx",
+ msg->arg.pagefault.flags);
+
+ rwprotect_range(gopts->uffd, msg->arg.pagefault.address,
+ gopts->page_size, false);
+ uargs->minor_faults++;
+}
+
+/*
+ * Test sync RWP faults on anonymous memory.
+ * Populate pages, register MODE_RWP (sync), RW-protect,
+ * access from worker thread, verify fault delivered, UFFDIO_RWPROTECT resolves.
+ */
+static void uffd_rwp_sync_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ pthread_t uffd_mon;
+ struct uffd_args uargs = { };
+ bool failed = false;
+ char c = '\0';
+ unsigned long p;
+
+ uargs.gopts = gopts;
+ uargs.handle_fault = uffd_handle_rwp_fault;
+
+ /* Populate all pages */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ /* Register MODE_RWP */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ /* RW-protect all pages */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Start fault handler thread */
+ if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs))
+ err("uffd_poll_thread create");
+
+ /* Access all pages — triggers sync RWP faults, handler unprotects */
+ for (p = 0; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+
+ if (page[0] != (p % 255 + 1)) {
+ uffd_test_fail("page %lu content mismatch", p);
+ failed = true;
+ goto out;
+ }
+ }
+
+out:
+ /*
+ * Stop the handler before reading minor_faults: the last fault
+ * resolution rwprotect_range()s before incrementing the counter,
+ * so the main thread can race ahead of the increment.
+ */
+ if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+ err("pipe write");
+ if (pthread_join(uffd_mon, NULL))
+ err("join() failed");
+
+ if (failed)
+ return;
+ if (uargs.minor_faults == 0)
+ uffd_test_fail("expected RWP faults, got 0");
+ else
+ uffd_test_pass();
+}
+
+/*
+ * Test PAGEMAP_SCAN detection of RW-protected (cold) pages.
+ */
+static void uffd_rwp_pagemap_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+ struct page_region regions[16];
+ struct pm_scan_arg pm_arg;
+ int pagemap_fd;
+ long ret;
+
+ /* Need at least 4 pages */
+ if (nr_pages < 4) {
+ uffd_test_skip("need at least 4 pages");
+ return;
+ }
+
+ /* Populate all pages */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, 0xab, page_size);
+
+ /* Register and RW-protect */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Touch first half of pages to re-activate them (async auto-resolve) */
+ for (p = 0; p < nr_pages / 2; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page;
+ }
+
+ /* Scan for cold (still RW-protected) pages */
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (pagemap_fd < 0)
+ err("open pagemap");
+
+ /*
+ * PAGE_IS_ACCESSED is set once the uffd-wp bit has been cleared
+ * (access happened, or the user resolved). Invert it to select
+ * still-protected (cold) pages.
+ */
+ memset(&pm_arg, 0, sizeof(pm_arg));
+ pm_arg.size = sizeof(pm_arg);
+ pm_arg.start = (uint64_t)gopts->area_dst;
+ pm_arg.end = (uint64_t)gopts->area_dst + nr_pages * page_size;
+ pm_arg.vec = (uint64_t)regions;
+ pm_arg.vec_len = ARRAY_SIZE(regions);
+ pm_arg.category_mask = PAGE_IS_ACCESSED;
+ pm_arg.category_inverted = PAGE_IS_ACCESSED;
+ pm_arg.return_mask = PAGE_IS_ACCESSED;
+
+ ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg);
+ close(pagemap_fd);
+
+ if (ret < 0) {
+ uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno));
+ return;
+ }
+
+ /*
+ * The second half of pages should be reported as RW-protected.
+ * They may be coalesced into one region.
+ */
+ if (ret < 1) {
+ uffd_test_fail("expected cold pages, got %ld regions", ret);
+ return;
+ }
+
+ /* Verify the cold region covers the second half */
+ uint64_t cold_start = regions[0].start;
+ uint64_t expected_start = (uint64_t)gopts->area_dst +
+ (nr_pages / 2) * page_size;
+
+ if (cold_start != expected_start) {
+ uffd_test_fail("cold region starts at 0x%lx, expected 0x%lx",
+ (unsigned long)cold_start,
+ (unsigned long)expected_start);
+ return;
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Test that RWP protection survives a mprotect(PROT_NONE) ->
+ * mprotect(PROT_READ|PROT_WRITE) round-trip. The uffd-wp bit on a
+ * VM_UFFD_RWP VMA must continue to carry PROT_NONE semantics after
+ * mprotect() changes the base protection; otherwise accesses would
+ * silently succeed and the pagemap bit would stick without a fault
+ * ever clearing it.
+ */
+static void uffd_rwp_mprotect_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+ struct page_region regions[16];
+ struct pm_scan_arg pm_arg;
+ int pagemap_fd;
+ long ret;
+
+ /* Populate all pages */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, 0xab, page_size);
+
+ /* Register and RW-protect the whole range */
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Round-trip mprotect(): PROT_NONE -> PROT_READ|PROT_WRITE */
+ if (mprotect(gopts->area_dst, nr_pages * page_size, PROT_NONE))
+ err("mprotect() PROT_NONE");
+ if (mprotect(gopts->area_dst, nr_pages * page_size,
+ PROT_READ | PROT_WRITE))
+ err("mprotect() PROT_READ|PROT_WRITE");
+
+ /* Touch every page. Async RWP must auto-resolve each fault. */
+ for (p = 0; p < nr_pages; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page;
+ }
+
+ /*
+ * After touching, no page should remain RW-protected. A stuck
+ * uffd-wp bit would mean mprotect() silently dropped PROT_NONE and
+ * the access never faulted.
+ */
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (pagemap_fd < 0)
+ err("open pagemap");
+
+ memset(&pm_arg, 0, sizeof(pm_arg));
+ pm_arg.size = sizeof(pm_arg);
+ pm_arg.start = (uint64_t)gopts->area_dst;
+ pm_arg.end = (uint64_t)gopts->area_dst + nr_pages * page_size;
+ pm_arg.vec = (uint64_t)regions;
+ pm_arg.vec_len = ARRAY_SIZE(regions);
+ pm_arg.category_mask = PAGE_IS_ACCESSED;
+ pm_arg.category_inverted = PAGE_IS_ACCESSED;
+ pm_arg.return_mask = PAGE_IS_ACCESSED;
+
+ ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg);
+ close(pagemap_fd);
+
+ if (ret < 0) {
+ uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno));
+ return;
+ }
+ if (ret != 0) {
+ uffd_test_fail("expected no cold pages after mprotect()+touch, got %ld regions",
+ ret);
+ return;
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Test that GUP resolves through protnone PTEs (async mode).
+ * vmsplice() into a pipe pins user pages via get_user_pages_fast() --
+ * unlike write(), which goes through copy_from_user() and ordinary
+ * hardware page faults -- so it exercises gup_can_follow_protnone() on
+ * the RW-protected PTE. In async mode the kernel auto-restores
+ * permissions and GUP returns the page.
+ */
+static void uffd_rwp_gup_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ struct iovec iov;
+ char buf;
+ int pipefd[2];
+
+ /* Populate first page with known content */
+ memset(gopts->area_dst, 0xCD, gopts->page_size);
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst, gopts->page_size))
+ err("register failure");
+
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ gopts->page_size, true);
+
+ if (pipe(pipefd))
+ err("pipe");
+
+ /*
+ * One byte's worth of iov is enough to GUP the containing page and
+ * keeps the pipe transfer well under any pipe-capacity limit even on
+ * hugetlb-backed runs.
+ */
+ iov.iov_base = gopts->area_dst;
+ iov.iov_len = 1;
+ if (vmsplice(pipefd[1], &iov, 1, 0) != 1) {
+ uffd_test_fail("vmsplice from RW-protected page failed: %s",
+ strerror(errno));
+ goto out;
+ }
+
+ if (read(pipefd[0], &buf, 1) != 1) {
+ uffd_test_fail("read from pipe failed");
+ goto out;
+ }
+
+ if (buf != (char)0xCD) {
+ uffd_test_fail("content mismatch: got 0x%02x, expected 0xCD",
+ (unsigned char)buf);
+ goto out;
+ }
+
+ uffd_test_pass();
+out:
+ close(pipefd[0]);
+ close(pipefd[1]);
+}
+
+/*
+ * Test runtime toggle between async and sync modes.
+ * Start in async mode (detection), flip to sync (eviction), verify faults
+ * block, resolve them, flip back to async.
+ */
+static void uffd_rwp_async_toggle_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ struct uffd_args uargs = { };
+ pthread_t uffd_mon;
+ char c = '\0';
+ unsigned long p;
+
+ uargs.gopts = gopts;
+ uargs.handle_fault = uffd_handle_rwp_fault;
+
+ /* Populate */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ /* Phase 1: async detection — RW-protect, access first half */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ for (p = 0; p < nr_pages / 2; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page; /* auto-resolves in async mode */
+ }
+
+ /* Phase 2: flip to sync for eviction */
+ set_async_mode(gopts->uffd, false);
+
+ /* Start handler — will receive faults for cold pages */
+ if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs))
+ err("uffd_poll_thread create");
+
+ /* Access second half (cold pages) — should trigger sync faults */
+ for (p = nr_pages / 2; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+ if (page[0] != (p % 255 + 1)) {
+ uffd_test_fail("page %lu content mismatch", p);
+ goto out;
+ }
+ }
+
+ /*
+ * Stop the handler before reading minor_faults: the last fault
+ * resolution rwprotect_range()s before incrementing the counter,
+ * so the main thread can race ahead of the increment. Stopping
+ * here also makes Phase 3 a clean async-only test -- with the
+ * handler still running it would silently resolve any sync fault
+ * the kernel erroneously delivers, masking a regression.
+ */
+ if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+ err("pipe write");
+ if (pthread_join(uffd_mon, NULL))
+ err("join() failed");
+
+ if (uargs.minor_faults == 0) {
+ uffd_test_fail("expected sync faults, got 0");
+ return;
+ }
+
+ /* Phase 3: flip back to async */
+ set_async_mode(gopts->uffd, true);
+
+ /* RW-protect and access again — should auto-resolve */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ for (p = 0; p < nr_pages; p++) {
+ volatile char *page = gopts->area_dst + p * page_size;
+ (void)*page;
+ }
+
+ uffd_test_pass();
+ return;
+out:
+ if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+ err("pipe write");
+ if (pthread_join(uffd_mon, NULL))
+ err("join() failed");
+}
+
+/*
+ * Test that RW-protected pages become accessible after closing uffd.
+ */
+static void uffd_rwp_close_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ unsigned long p;
+
+ /* Populate */
+ for (p = 0; p < nr_pages; p++)
+ memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failure");
+
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ nr_pages * page_size, true);
+
+ /* Close uffd — should restore protnone PTEs */
+ close(gopts->uffd);
+ gopts->uffd = -1;
+
+ /* All pages should be accessible with original content */
+ for (p = 0; p < nr_pages; p++) {
+ unsigned char *page = (unsigned char *)gopts->area_dst +
+ p * page_size;
+ unsigned char expected = p % 255 + 1;
+
+ if (page[0] != expected) {
+ uffd_test_fail("page %lu not accessible after close", p);
+ return;
+ }
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * Test that RWP protection is preserved across fork() when
+ * UFFD_FEATURE_EVENT_FORK is enabled. Without preservation, the child's
+ * PTEs would lose the uffd-wp marker and RWP-protected accesses would
+ * silently fall through to do_numa_page().
+ */
+static void uffd_rwp_fork_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ int pagemap_fd;
+ uint64_t value;
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst,
+ nr_pages * page_size))
+ err("register failed");
+
+ /* Populate + RWP-protect */
+ *gopts->area_dst = 1;
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst,
+ page_size, true);
+
+ /* Parent: verify uffd-wp bit is set before fork */
+ pagemap_fd = pagemap_open();
+ value = pagemap_get_entry(pagemap_fd, gopts->area_dst);
+ pagemap_check_wp(value, true);
+
+ /*
+ * Fork with EVENT_FORK: child inherits VM_UFFD_RWP. Child reads
+ * its own pagemap and must still see the uffd-wp bit set.
+ */
+ if (pagemap_test_fork(gopts, true, false)) {
+ uffd_test_fail("RWP marker lost in child after fork");
+ goto out;
+ }
+
+ uffd_test_pass();
+out:
+ close(pagemap_fd);
+}
+
+/*
+ * Test that RWP protection on a pinned anon page is preserved across fork().
+ * Pinning forces copy_present_page() in the child path, which must restore
+ * PAGE_NONE on top of the uffd bit. Using async mode, a read in the child
+ * auto-resolves if — and only if — the PTE was actually protnone+uffd; the
+ * cleared uffd bit afterward proves the fault path ran.
+ */
+static void uffd_rwp_fork_pin_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long page_size = gopts->page_size;
+ fork_event_args fevent_args = { .gopts = gopts, .child_uffd = -1 };
+ pin_args pin_args = {};
+ int pagemap_fd, status;
+ pthread_t fevent_thread;
+ uint64_t value;
+ pid_t child;
+
+ if (uffd_register_rwp(gopts->uffd, gopts->area_dst, page_size))
+ err("register failed");
+
+ /* Populate. */
+ *gopts->area_dst = 1;
+
+ /* RO-longterm pin so fork() takes copy_present_page() for this PTE. */
+ if (pin_pages(&pin_args, gopts->area_dst, page_size)) {
+ uffd_test_skip("Possibly CONFIG_GUP_TEST missing or unprivileged");
+ uffd_unregister(gopts->uffd, gopts->area_dst, page_size);
+ return;
+ }
+
+ /* RWP-protect: PTE is now PAGE_NONE + uffd bit. */
+ rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, page_size, true);
+
+ pagemap_fd = pagemap_open();
+ value = pagemap_get_entry(pagemap_fd, gopts->area_dst);
+ pagemap_check_wp(value, true);
+
+ /*
+ * UFFD_FEATURE_EVENT_FORK is required so the child inherits
+ * VM_UFFD_RWP and the marker; without it dup_userfaultfd() resets
+ * the child VMA and the test would pass for the wrong reason.
+ * dup_userfaultfd() blocks until the EVENT_FORK message is consumed,
+ * so spawn a reader before the fork().
+ */
+ gopts->ready_for_fork = false;
+ if (pthread_create(&fevent_thread, NULL, fork_event_consumer,
+ &fevent_args))
+ err("pthread_create() for fork event consumer");
+ while (!gopts->ready_for_fork)
+ ; /* Wait for consumer to start polling. */
+
+ child = fork();
+ if (child < 0)
+ err("fork");
+ if (child == 0) {
+ volatile char c;
+ int cfd;
+
+ /*
+ * Read the pinned page. Only reaches the fault path if the
+ * child PTE is protnone + uffd; async mode auto-resolves and
+ * clears the uffd bit. If copy_present_page() dropped
+ * PAGE_NONE, the read would silently succeed and the bit
+ * would still be set.
+ */
+ c = *(volatile char *)gopts->area_dst;
+ (void)c;
+
+ cfd = pagemap_open();
+ value = pagemap_get_entry(cfd, gopts->area_dst);
+ close(cfd);
+ _exit((value & PM_UFFD_WP) ? 1 : 0);
+ }
+ if (waitpid(child, &status, 0) < 0)
+ err("waitpid");
+ if (pthread_join(fevent_thread, NULL))
+ err("pthread_join() for fork event consumer");
+ if (fevent_args.child_uffd >= 0)
+ close(fevent_args.child_uffd);
+
+ unpin_pages(&pin_args);
+ close(pagemap_fd);
+ if (uffd_unregister(gopts->uffd, gopts->area_dst, page_size))
+ err("unregister failed");
+
+ if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
+ uffd_test_fail("RWP not enforced in child after pinned fork");
+ return;
+ }
+
+ uffd_test_pass();
+}
+
+/*
+ * WP and RWP share the uffd-wp PTE bit and cannot coexist in the same VMA.
+ * Registration requesting both modes must be rejected.
+ */
+static void uffd_rwp_wp_exclusive_test(uffd_global_test_opts_t *gopts,
+ uffd_test_args_t *args)
+{
+ unsigned long nr_pages = gopts->nr_pages;
+ unsigned long page_size = gopts->page_size;
+ struct uffdio_register reg = { };
+
+ reg.range.start = (unsigned long)gopts->area_dst;
+ reg.range.len = nr_pages * page_size;
+ reg.mode = UFFDIO_REGISTER_MODE_WP | UFFDIO_REGISTER_MODE_RWP;
+
+ if (ioctl(gopts->uffd, UFFDIO_REGISTER, ®) == 0) {
+ uffd_test_fail("register with WP|RWP unexpectedly succeeded");
+ return;
+ }
+ if (errno != EINVAL) {
+ uffd_test_fail("register with WP|RWP: expected EINVAL, got %d",
+ errno);
+ return;
+ }
+ uffd_test_pass();
+}
+
static sigjmp_buf jbuf, *sigbuf;
static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
@@ -1604,6 +2298,77 @@ uffd_test_case_t uffd_tests[] = {
/* We can't test MADV_COLLAPSE, so try our luck */
.uffd_feature_required = UFFD_FEATURE_MINOR_SHMEM,
},
+ {
+ .name = "rwp-async",
+ .uffd_fn = uffd_rwp_async_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-sync",
+ .uffd_fn = uffd_rwp_sync_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required = UFFD_FEATURE_RWP,
+ },
+ {
+ .name = "rwp-pagemap",
+ .uffd_fn = uffd_rwp_pagemap_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-mprotect",
+ .uffd_fn = uffd_rwp_mprotect_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-gup",
+ .uffd_fn = uffd_rwp_gup_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-async-toggle",
+ .uffd_fn = uffd_rwp_async_toggle_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ },
+ {
+ .name = "rwp-close",
+ .uffd_fn = uffd_rwp_close_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required = UFFD_FEATURE_RWP,
+ },
+ {
+ .name = "rwp-fork",
+ .uffd_fn = uffd_rwp_fork_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_EVENT_FORK,
+ },
+ {
+ .name = "rwp-fork-pin",
+ .uffd_fn = uffd_rwp_fork_pin_test,
+ .mem_targets = MEM_ANON,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC |
+ UFFD_FEATURE_EVENT_FORK,
+ },
+ {
+ .name = "rwp-wp-exclusive",
+ .uffd_fn = uffd_rwp_wp_exclusive_test,
+ .mem_targets = MEM_ALL,
+ .uffd_feature_required =
+ UFFD_FEATURE_RWP |
+ UFFD_FEATURE_PAGEFAULT_FLAG_WP |
+ UFFD_FEATURE_WP_HUGETLBFS_SHMEM,
+ },
{
.name = "sigbus",
.uffd_fn = uffd_sigbus_test,
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 18/18] Documentation/userfaultfd: document RWP working set tracking
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
` (16 preceding siblings ...)
2026-05-26 13:05 ` [PATCH v5 17/18] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau
@ 2026-05-26 13:05 ` Kiryl Shutsemau
17 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-26 13:05 UTC (permalink / raw)
To: akpm, rppt, peterx, david
Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
Kiryl Shutsemau (Meta)
From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Add an admin-guide section covering UFFDIO_REGISTER_MODE_RWP:
- sync and async fault models;
- UFFDIO_RWPROTECT semantics;
- UFFD_FEATURE_RWP_ASYNC;
- UFFDIO_SET_MODE runtime mode flips.
It also covers typical VMM working-set-tracking workflow from detection
loop through sync-mode eviction and back to async.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
Documentation/admin-guide/mm/userfaultfd.rst | 238 ++++++++++++++++++-
1 file changed, 232 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 1e533639fd50..1db99b5355f7 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -275,16 +275,16 @@ tracking and it can be different in a few ways:
- Dirty information will not get lost if the pte was zapped due to
various reasons (e.g. during split of a shmem transparent huge page).
- - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit
- set; dirty when uffd-wp bit cleared), it has different semantics on
- some of the memory operations. For example: ``MADV_DONTNEED`` on
+ - Due to a reverted meaning of soft-dirty (page clean when the uffd bit
+ is set; dirty when the uffd bit is cleared), it has different semantics
+ on some of the memory operations. For example: ``MADV_DONTNEED`` on
anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as
- dirtying of memory by dropping uffd-wp bit during the procedure.
+ dirtying of memory by dropping the uffd bit during the procedure.
The user app can collect the "written/dirty" status by looking up the
-uffd-wp bit for the pages being interested in /proc/pagemap.
+uffd bit for the pages being interested in /proc/pagemap.
-The page will not be under track of uffd-wp async mode until the page is
+The page will not be under track of userfaultfd-wp async mode until the page is
explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode
flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
that was tracked by async mode userfaultfd-wp is invalid.
@@ -307,6 +307,232 @@ transparent to the guest, we want that same address range to act as if it was
still poisoned, even though it's on a new physical host which ostensibly
doesn't have a memory error in the exact same spot.
+Read-Write Protection
+---------------------
+
+``UFFDIO_REGISTER_MODE_RWP`` enables read-write protection tracking on a
+memory range. It is similar to (but faster than) ``mprotect(PROT_NONE)``
+combined with a signal handler; unlike ``mprotect(PROT_NONE)``, RWP only
+traps accesses to *present* PTEs, so accesses to unpopulated addresses in a
+protected range fall through to the normal missing-page path. It uses the
+PROT_NONE hinting mechanism (same as NUMA balancing) to make pages
+inaccessible while keeping them resident in memory. Works on anonymous,
+shmem, and hugetlbfs memory.
+
+RWP is designed for VM memory managers that need to track the working set
+of guest memory for cold page eviction to tiered or remote storage.
+
+**Setup:**
+
+1. Open a userfaultfd and enable ``UFFD_FEATURE_RWP`` via ``UFFDIO_API``.
+ Optionally request ``UFFD_FEATURE_RWP_ASYNC`` as well — it requires
+ ``UFFD_FEATURE_RWP`` to be set in the same ``UFFDIO_API`` call.
+
+2. Register the guest memory range with ``UFFDIO_REGISTER_MODE_RWP``
+ (and ``UFFDIO_REGISTER_MODE_MISSING`` if evicted pages will need to be
+ fetched back from storage).
+
+**Feature availability:**
+
+RWP is built on top of two kernel primitives: a spare PTE bit owned by
+userfaultfd (``CONFIG_HAVE_ARCH_USERFAULTFD_WP``) and architecture support
+for present-but-inaccessible PTEs (``CONFIG_ARCH_HAS_PTE_PROTNONE``). When both
+are available on a 64-bit kernel, the build selects
+``CONFIG_USERFAULTFD_RWP=y`` and the ``VM_UFFD_RWP`` VMA flag becomes
+available.
+
+``UFFD_FEATURE_RWP`` and ``UFFD_FEATURE_RWP_ASYNC`` are masked out of the
+features returned by ``UFFDIO_API`` when the running kernel or architecture
+cannot support them — for example 32-bit kernels (where ``VM_UFFD_RWP`` is
+unavailable), kernels built without ``CONFIG_USERFAULTFD_RWP``, and
+architectures whose ptes cannot carry the uffd bit at runtime (e.g. riscv
+without the ``SVRSW60T59B`` extension). ``UFFDIO_API`` does not fail;
+unsupported bits are simply absent from ``uffdio_api.features`` on return.
+Callers should inspect the returned ``features`` after ``UFFDIO_API`` and
+fall back to another tracking method when RWP is unavailable.
+
+**Protecting and Unprotecting:**
+
+Use ``UFFDIO_RWPROTECT`` to protect or unprotect a range, mirroring the
+``UFFDIO_WRITEPROTECT`` interface::
+
+ struct uffdio_rwprotect rwp = {
+ .range = { .start = addr, .len = len },
+ .mode = UFFDIO_RWPROTECT_MODE_RWP, /* protect */
+ };
+ ioctl(uffd, UFFDIO_RWPROTECT, &rwp);
+
+Setting ``UFFDIO_RWPROTECT_MODE_RWP`` sets PROT_NONE on present PTEs in the
+range. Pages stay resident and their physical frames are preserved — only
+access permissions are removed.
+
+Clearing ``UFFDIO_RWPROTECT_MODE_RWP`` restores normal VMA permissions and
+wakes any faulting threads (unless ``UFFDIO_RWPROTECT_MODE_DONTWAKE`` is set).
+
+**Scope of protection:**
+
+RWP protection is a property of *present* PTEs. ``UFFDIO_RWPROTECT`` only
+affects entries that are already populated. Unpopulated addresses within
+the range remain unpopulated; when first accessed they fault through the
+normal missing path (``do_anonymous_page()``, ``do_swap_page()``,
+``finish_fault()``) and the resulting PTE is not RWP-protected. To observe
+the population itself, co-register the range with
+``UFFDIO_REGISTER_MODE_MISSING``.
+
+Protection is preserved across page reclaim: a page swapped out while
+RWP-protected carries the marker on its swap entry, and swap-in restores
+the PROT_NONE state so the first access after swap-in still faults. The
+same applies to pages temporarily replaced by migration entries.
+
+Operations that drop the PTE entirely — ``MADV_DONTNEED`` on anonymous
+memory, hole-punch on shmem, truncation of a file mapping — also drop the
+RWP marker: the next access re-populates the range without protection.
+Unlike WP (which persists via ``PTE_MARKER_UFFD_WP``), there is no
+persistent RWP marker today. The user needs to re-arm the range with
+``UFFDIO_RWPROTECT`` after any operation that explicitly frees PTEs.
+
+**Fault Handling:**
+
+When a protected page is accessed:
+
+- **Sync mode** (default): The faulting thread blocks and a
+ ``UFFD_PAGEFAULT_FLAG_RWP`` message is delivered to the userfaultfd
+ handler. The handler resolves the fault with ``UFFDIO_RWPROTECT``
+ (clearing ``MODE_RWP``), which restores the PTE permissions and wakes
+ the faulting thread.
+
+- **Async mode** (``UFFD_FEATURE_RWP_ASYNC``): The kernel automatically
+ restores PTE permissions and the thread continues without blocking. No
+ message is delivered to the handler.
+
+**Runtime Mode Switching:**
+
+``UFFDIO_SET_MODE`` toggles ``UFFD_FEATURE_RWP_ASYNC`` at runtime, allowing
+the VMM to switch between lightweight async detection and safe sync
+eviction without re-registering. The toggle takes ``mmap_write_lock()``
+and calls ``vma_start_write()`` on each UFFD-armed VMA, draining
+in-flight per-VMA-locked faults before the new mode takes effect.
+
+**Cold Page Detection with PAGEMAP_SCAN:**
+
+RWP-protected PTEs carry the uffd PTE bit; the fault-resolution path
+clears it. ``PAGEMAP_SCAN`` reports ``PAGE_IS_ACCESSED`` once the bit is
+clear on a ``VM_UFFD_RWP`` VMA, so inverting it efficiently reports the
+still-protected (cold) pages::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .start = guest_mem_start,
+ .end = guest_mem_end,
+ .vec = (uint64_t)regions,
+ .vec_len = regions_len,
+ .category_mask = PAGE_IS_ACCESSED,
+ .category_inverted = PAGE_IS_ACCESSED,
+ .return_mask = PAGE_IS_ACCESSED,
+ };
+ long n = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
+
+The returned ``page_region`` array contains contiguous cold ranges that can
+then be evicted.
+
+**Cleanup:**
+
+When the userfaultfd is closed or the range is unregistered, all PROT_NONE
+PTEs are automatically restored to their normal VMA permissions. This
+prevents pages from becoming permanently inaccessible.
+
+**VMM Working Set Tracking Workflow:**
+
+A typical VMM lifecycle for cold page eviction to tiered storage. Two
+mappings of the same shmem (or hugetlbfs) file are used: ``guest_mem`` is
+the RWP-registered mapping that vCPUs access through, and ``io_mem`` is a
+private mapping for VMM-side I/O. Reading ``io_mem`` does not go through
+the RWP-protected PTEs of ``guest_mem``, so the VMM's own ``pwrite()``
+never traps on its own ::
+
+ /* One-time setup */
+ fd = memfd_create("guest", MFD_CLOEXEC);
+ ftruncate(fd, guest_size);
+ guest_mem = mmap(NULL, guest_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0); /* vCPU view, RWP-registered */
+ io_mem = mmap(NULL, guest_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0); /* VMM I/O view, unprotected */
+
+ uffd = userfaultfd(O_CLOEXEC | O_NONBLOCK);
+ struct uffdio_api api = {
+ .api = UFFD_API,
+ .features = UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC,
+ };
+ ioctl(uffd, UFFDIO_API, &api);
+ if (!(api.features & UFFD_FEATURE_RWP))
+ /* RWP unavailable on this kernel/arch -- fall back. */
+ ioctl(uffd, UFFDIO_REGISTER, &(struct uffdio_register){
+ .range = { guest_mem, guest_size },
+ .mode = UFFDIO_REGISTER_MODE_RWP |
+ UFFDIO_REGISTER_MODE_MISSING,
+ });
+
+ /* Tracking loop */
+ while (vm_running) {
+ /* 1. Detection phase (async -- no vCPU stalls) */
+ ioctl(uffd, UFFDIO_RWPROTECT, &(struct uffdio_rwprotect){
+ .range = full_range,
+ .mode = UFFDIO_RWPROTECT_MODE_RWP });
+ sleep(tracking_interval);
+
+ /*
+ * 2. Switch to sync BEFORE scanning. In async mode a vCPU
+ * access between the scan and any eviction step silently
+ * clears the uffd bit, so the scan would already disagree
+ * with the page state by the time eviction begins. Sync mode
+ * blocks vCPU accesses, freezing the cold snapshot for the
+ * rest of the iteration.
+ */
+ ioctl(uffd, UFFDIO_SET_MODE,
+ &(struct uffdio_set_mode){
+ .disable = UFFD_FEATURE_RWP_ASYNC });
+
+ /* 3. Find cold pages (uffd bit still set) */
+ ioctl(pagemap_fd, PAGEMAP_SCAN, &(struct pm_scan_arg){
+ .category_mask = PAGE_IS_ACCESSED,
+ .category_inverted = PAGE_IS_ACCESSED,
+ .return_mask = PAGE_IS_ACCESSED,
+ ...
+ });
+
+ /* 4. Evict cold pages (vCPU faults block on guest_mem) */
+ for each cold range:
+ /* Read from io_mem -- bypasses RWP, no fault. */
+ pwrite(storage_fd, (char *)io_mem + cold_offset,
+ len, cold_offset);
+ /* Drop the page from the shared file. */
+ fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ cold_offset, len);
+ /*
+ * Wake any vCPU blocked on the RWP fault for this range:
+ * fallocate() does not iterate ctx->fault_pending_wqh.
+ */
+ ioctl(uffd, UFFDIO_WAKE, &(struct uffdio_range){
+ .start = (uintptr_t)guest_mem + cold_offset,
+ .len = len });
+
+ /* 5. Resume async tracking */
+ ioctl(uffd, UFFDIO_SET_MODE,
+ &(struct uffdio_set_mode){
+ .enable = UFFD_FEATURE_RWP_ASYNC });
+ }
+
+During step 4, a vCPU that accesses ``guest_mem + cold_offset`` blocks
+with a ``UFFD_PAGEFAULT_FLAG_RWP`` fault while the eviction is in
+progress. After ``fallocate()`` punches the page out and ``UFFDIO_WAKE``
+fires, the vCPU retries the access, faults as ``MISSING``, and the
+handler resolves it with ``UFFDIO_COPY`` from storage.
+
+This workflow targets shmem and hugetlbfs (both support a private
+``io_mem`` mapping over the same fd). Anonymous-memory backings need a
+different inner-loop strategy because the VMM has no way to read the
+page without going through the RWP-protected mapping.
+
QEMU/KVM
========
--
2.54.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v5 09/18] mm: add MM_CP_UFFD_RWP change_protection() flag
2026-05-26 13:04 ` [PATCH v5 09/18] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau
@ 2026-05-29 1:19 ` SeongJae Park
0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-29 1:19 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: SeongJae Park, akpm, rppt, peterx, david, ljs, surenb, vbabka,
Liam.Howlett, ziy, corbet, skhan, seanjc, pbonzini, jthoughton,
aarcange, usama.arif, linux-mm, linux-kernel, linux-doc,
linux-kselftest, kvm, kernel-team, Kiryl Shutsemau (Meta)
On Tue, 26 May 2026 14:04:57 +0100 Kiryl Shutsemau <kirill@shutemov.name> wrote:
> From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
>
> Preparatory patch. Add the change_protection() primitive that
> userfaultfd RWP will use.
>
> An RWP-protected PTE is PAGE_NONE with the uffd PTE bit set. The
> PROT_NONE half makes the CPU fault on any access; the uffd bit
> distinguishes an RWP fault from a plain mprotect(PROT_NONE) or NUMA
> hinting fault. MM_CP_UFFD_WP and MM_CP_UFFD_RWP share the same PTE
> bit, so the two cannot be used together on the same range.
>
> Two new change_protection() flags:
>
> MM_CP_UFFD_RWP install PAGE_NONE and set the uffd bit
> MM_CP_UFFD_RWP_RESOLVE restore vma->vm_page_prot, clear the uffd bit
>
> Both are wired through change_pte_range(), change_huge_pmd(), and
> hugetlb_change_protection() so anon, shmem, THP, and hugetlb all
> share the same semantics.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
> ---
> include/linux/mm.h | 5 ++++
> include/linux/userfaultfd_k.h | 1 -
> mm/huge_memory.c | 30 +++++++++++++----------
> mm/hugetlb.c | 25 ++++++++++++++-----
> mm/mprotect.c | 46 +++++++++++++++++++++++++++--------
> 5 files changed, 77 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6499cfb61dc4..f79801816f32 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3297,6 +3297,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
> #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
> MM_CP_UFFD_WP_RESOLVE)
> +/* Whether this change is for uffd RWP */
> +#define MM_CP_UFFD_RWP (1UL << 4) /* do rwp */
> +#define MM_CP_UFFD_RWP_RESOLVE (1UL << 5) /* resolve rwp */
> +#define MM_CP_UFFD_RWP_ALL (MM_CP_UFFD_RWP | \
> + MM_CP_UFFD_RWP_RESOLVE)
>
> bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 0aef628514df..564eb2aac321 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -361,7 +361,6 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> return false;
> }
>
> -
Nice cleanup.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag
2026-05-26 13:04 ` [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau
@ 2026-05-29 7:24 ` Lorenzo Stoakes
2026-05-29 13:07 ` Kiryl Shutsemau
0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2026-05-29 7:24 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team, Kiryl Shutsemau (Meta)
On Tue, May 26, 2026 at 02:04:56PM +0100, Kiryl Shutsemau wrote:
> From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
>
> Preparatory patch for userfaultfd read-write protection (RWP). RWP
> extends userfaultfd protection from plain write-protection (WP) to
> full read-write protection: accesses to an RWP-protected range --
> reads as well as writes -- trap through userfaultfd.
>
> Reserve VM_UFFD_RWP, add the userfaultfd_rwp() and
> userfaultfd_protected() helpers, and wire up the smaps "ur" entry and
> the trace-flag table the rest of the series will use. The flag is
> gated on CONFIG_USERFAULTFD_RWP, which is introduced together with the
> UAPI in a later patch; until then VM_UFFD_RWP aliases VM_NONE and
> every downstream check folds to dead code.
>
> Nothing sets or queries the flag yet.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Assisted-by: Claude:claude-opus-4-6
Hm, if you've just used claude to bounce ideas off, I'm really not sure if
it's necessary to disclose, though I respect your thoroughness for doing so
:)
I guess determining the threshold at which it makes sense to do so is still
a WIP for us in the kernel.
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Reviewed-by: SeongJae Park <sj@kernel.org>
> ---
> Documentation/filesystems/proc.rst | 1 +
> fs/proc/task_mmu.c | 3 +++
> include/linux/mm.h | 28 +++++++++++++++++----------
> include/linux/userfaultfd_k.h | 31 +++++++++++++++++++++++++-----
> include/trace/events/mmflags.h | 7 +++++++
> 5 files changed, 55 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index db6167befb7b..db28207c5290 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -607,6 +607,7 @@ encoded manner. The codes are the following:
> um userfaultfd missing tracking
> uw userfaultfd wr-protect tracking
> ui userfaultfd minor fault
> + ur userfaultfd read-write-protect tracking
> ss shadow/guarded control stack page
> sl sealed
> lf lock on fault pages
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 1e5f6ee8a3b6..974c5f4aa533 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1237,6 +1237,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> [ilog2(VM_UFFD_MINOR)] = "ui",
> #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> +#ifdef CONFIG_USERFAULTFD_RWP
> + [ilog2(VM_UFFD_RWP)] = "ur",
> +#endif
> #ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> [ilog2(VM_SHADOW_STACK)] = "ss",
> #endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 71b11945e4fc..6499cfb61dc4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -362,6 +362,7 @@ enum {
> #endif
> DECLARE_VMA_BIT(UFFD_MINOR, 41),
> DECLARE_VMA_BIT(SEALED, 42),
> + DECLARE_VMA_BIT(UFFD_RWP, 43),
I'm guessing CONFIG_USERFAULTFD_RWP is predicated on CONFIG_64BIT?
It's a silly situation and once my VMA flags stuff is done it'll be
eliminated but for now... :)
> /* Flags that reuse flags above. */
> DECLARE_VMA_BIT_ALIAS(PKEY_BIT0, HIGH_ARCH_0),
> DECLARE_VMA_BIT_ALIAS(PKEY_BIT1, HIGH_ARCH_1),
> @@ -505,6 +506,11 @@ enum {
> #else
> #define VM_UFFD_MINOR VM_NONE
> #endif
> +#ifdef CONFIG_USERFAULTFD_RWP
> +#define VM_UFFD_RWP INIT_VM_FLAG(UFFD_RWP)
> +#else
> +#define VM_UFFD_RWP VM_NONE
> +#endif
> #ifdef CONFIG_64BIT
> #define VM_ALLOW_ANY_UNCACHED INIT_VM_FLAG(ALLOW_ANY_UNCACHED)
> #define VM_SEALED INIT_VM_FLAG(SEALED)
> @@ -642,22 +648,24 @@ enum {
> * reconsistuted upon page fault, so necessitate page table copying upon fork.
> *
> * Note that these flags should be compared with the DESTINATION VMA not the
> - * source, as VM_UFFD_WP may not be propagated to destination, while all other
> - * flags will be.
> + * source: VM_UFFD_WP and VM_UFFD_RWP may be cleared on the destination
> + * (dup_userfaultfd() -> userfaultfd_reset_ctx() when the parent context did
> + * not negotiate UFFD_FEATURE_EVENT_FORK), while all other flags propagate.
> *
> * VM_PFNMAP / VM_MIXEDMAP - These contain kernel-mapped data which cannot be
> * reasonably reconstructed on page fault.
> *
> * VM_UFFD_WP - Encodes metadata about an installed uffd
> - * write protect handler, which cannot be
> - * reconstructed on page fault.
> + * VM_UFFD_RWP write- or read-write-protect handler, which
> + * cannot be reconstructed on page fault.
> *
> - * We always copy pgtables when dst_vma has uffd-wp
> - * enabled even if it's file-backed
> - * (e.g. shmem). Because when uffd-wp is enabled,
> - * pgtable contains uffd-wp protection information,
> - * that's something we can't retrieve from page cache,
> - * and skip copying will lose those info.
> + * We always copy pgtables when dst_vma has the
> + * uffd PTE bit in use even if it's file-backed
> + * (e.g. shmem). Because when the uffd bit is
> + * in use, the pgtable contains the protection
> + * information, that's something we can't
> + * retrieve from page cache, and skip copying
> + * will lose those info.
> *
> * VM_MAYBE_GUARD - Could contain page guard region markers which
> * by design are a property of the page tables
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index f4cf5763f92c..0aef628514df 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -21,10 +21,11 @@
> #include <linux/hugetlb_inline.h>
>
> /* The set of all possible UFFD-related VM flags. */
> -#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_MINOR | \
> + VM_UFFD_WP | VM_UFFD_RWP)
>
> #define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT, \
> - VMA_UFFD_MINOR_BIT)
> + VMA_UFFD_MINOR_BIT, VMA_UFFD_RWP_BIT)
>
> /*
> * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> @@ -178,7 +179,7 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> */
> static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> + return vma->vm_flags & (VM_UFFD_MINOR | VM_UFFD_WP | VM_UFFD_RWP);
While we're here we might as well switch to using the new API?
Can do:
return vma_test_any_mask(vma, __VMA_UFFD_FLAGS);
One unfortunate thing is using bit values means we can't do the VM_NONE
trick, but if !CONFIG_USERFAULTFD_RWP then VMA_UFFD_RWP_BIT wouldn't be set
anyway, same for minor so this should be fine?
> }
>
> /*
> @@ -208,6 +209,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> return vma->vm_flags & VM_UFFD_MINOR;
> }
>
> +static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
> +{
> + return vma->vm_flags & VM_UFFD_RWP;
> +}
Can be:
return vma_test(vma, VMA_UFFD_RWP_BIT);
> +
> +static inline bool userfaultfd_protected(struct vm_area_struct *vma)
> +{
> + return userfaultfd_wp(vma) || userfaultfd_rwp(vma);
> +}
> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> @@ -328,6 +339,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> return false;
> }
>
> +static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
> +{
> + return false;
> +}
> +
> +static inline bool userfaultfd_protected(struct vm_area_struct *vma)
> +{
> + return false;
> +}
> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> @@ -421,8 +442,8 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
> }
>
> /*
> - * Returns true if this is a swap pte and was uffd-wp wr-protected in either
> - * forms (pte marker or a normal swap pte), false otherwise.
> + * Returns true if this swap pte carries uffd-tracked state in either
> + * form (pte marker or a normal swap pte), false otherwise.
> */
> static inline bool pte_swp_uffd_any(pte_t pte)
> {
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index a6e5a44c9b42..bfface3d0203 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -194,6 +194,12 @@ IF_HAVE_PG_ARCH_3(arch_3)
> # define IF_HAVE_UFFD_MINOR(flag, name)
> #endif
>
> +#ifdef CONFIG_USERFAULTFD_RWP
> +# define IF_HAVE_UFFD_RWP(flag, name) {flag, name},
> +#else
> +# define IF_HAVE_UFFD_RWP(flag, name)
> +#endif
> +
> #if defined(CONFIG_64BIT) || defined(CONFIG_PPC32)
> # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
> #else
> @@ -215,6 +221,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \
> {VM_PFNMAP, "pfnmap" }, \
> {VM_MAYBE_GUARD, "maybe_guard" }, \
> {VM_UFFD_WP, "uffd_wp" }, \
> +IF_HAVE_UFFD_RWP(VM_UFFD_RWP, "uffd_rwp" ) \
> {VM_LOCKED, "locked" }, \
> {VM_IO, "io" }, \
> {VM_SEQ_READ, "seqread" }, \
> --
> 2.54.0
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag
2026-05-29 7:24 ` Lorenzo Stoakes
@ 2026-05-29 13:07 ` Kiryl Shutsemau
2026-05-29 14:00 ` Lorenzo Stoakes
0 siblings, 1 reply; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-29 13:07 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team
On Fri, May 29, 2026 at 08:24:55AM +0100, Lorenzo Stoakes wrote:
> On Tue, May 26, 2026 at 02:04:56PM +0100, Kiryl Shutsemau wrote:
> > From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
> >
> > Preparatory patch for userfaultfd read-write protection (RWP). RWP
> > extends userfaultfd protection from plain write-protection (WP) to
> > full read-write protection: accesses to an RWP-protected range --
> > reads as well as writes -- trap through userfaultfd.
> >
> > Reserve VM_UFFD_RWP, add the userfaultfd_rwp() and
> > userfaultfd_protected() helpers, and wire up the smaps "ur" entry and
> > the trace-flag table the rest of the series will use. The flag is
> > gated on CONFIG_USERFAULTFD_RWP, which is introduced together with the
> > UAPI in a later patch; until then VM_UFFD_RWP aliases VM_NONE and
> > every downstream check folds to dead code.
> >
> > Nothing sets or queries the flag yet.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > Assisted-by: Claude:claude-opus-4-6
>
> Hm, if you've just used claude to bounce ideas off, I'm really not sure if
> it's necessary to disclose, though I respect your thoroughness for doing so
> :)
I've elaborated on how I used Claude in reply to Andrew:
https://lore.kernel.org/all/af5eALk9yO8pPcHv@thinkstation
It is more than bouncing ideas.
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 71b11945e4fc..6499cfb61dc4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -362,6 +362,7 @@ enum {
> > #endif
> > DECLARE_VMA_BIT(UFFD_MINOR, 41),
> > DECLARE_VMA_BIT(SEALED, 42),
> > + DECLARE_VMA_BIT(UFFD_RWP, 43),
>
> I'm guessing CONFIG_USERFAULTFD_RWP is predicated on CONFIG_64BIT?
Yes:
depends on 64BIT && ARCH_HAS_PTE_PROTNONE && HAVE_ARCH_USERFAULTFD_WP
>
> It's a silly situation and once my VMA flags stuff is done it'll be
> eliminated but for now... :)
Yeah. I actually would appreciate your take on 04/18. It is related.
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index f4cf5763f92c..0aef628514df 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -21,10 +21,11 @@
> > #include <linux/hugetlb_inline.h>
> >
> > /* The set of all possible UFFD-related VM flags. */
> > -#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> > +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_MINOR | \
> > + VM_UFFD_WP | VM_UFFD_RWP)
> >
> > #define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT, \
> > - VMA_UFFD_MINOR_BIT)
> > + VMA_UFFD_MINOR_BIT, VMA_UFFD_RWP_BIT)
> >
> > /*
> > * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> > @@ -178,7 +179,7 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > */
> > static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> > {
> > - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> > + return vma->vm_flags & (VM_UFFD_MINOR | VM_UFFD_WP | VM_UFFD_RWP);
>
> While we're here we might as well switch to using the new API?
>
> Can do:
>
> return vma_test_any_mask(vma, __VMA_UFFD_FLAGS);
>
> One unfortunate thing is using bit values means we can't do the VM_NONE
> trick, but if !CONFIG_USERFAULTFD_RWP then VMA_UFFD_RWP_BIT wouldn't be set
> anyway, same for minor so this should be fine?
I think we need to decide first if the 04/18 direction is right.
We can define VMA_UFFD_RWP_BIT to VMA_NO_BIT if !CONFIG_USERFAULTFD_RWP.
> > }
> >
> > /*
> > @@ -208,6 +209,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > return vma->vm_flags & VM_UFFD_MINOR;
> > }
> >
> > +static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
> > +{
> > + return vma->vm_flags & VM_UFFD_RWP;
> > +}
>
> Can be:
>
> return vma_test(vma, VMA_UFFD_RWP_BIT);
Yep.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 04/18] mm: skip out-of-range bits in mk_vma_flags()
2026-05-26 13:04 ` [PATCH v5 04/18] mm: skip out-of-range bits in mk_vma_flags() Kiryl Shutsemau
@ 2026-05-29 14:00 ` Lorenzo Stoakes
2026-05-29 16:09 ` Kiryl Shutsemau
0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2026-05-29 14:00 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team, Kiryl Shutsemau (Meta), stable
On Tue, May 26, 2026 at 02:04:52PM +0100, Kiryl Shutsemau wrote:
> From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
>
> vma_flags_t is one unsigned long on 32-bit -- NUM_VMA_FLAG_BITS ==
> BITS_PER_LONG by design, so VM_xxx-declared bits sit in the first
> word and hit the single-long fast path. But the bit enum declares
> some bits unconditionally above BITS_PER_LONG (VMA_UFFD_MINOR_BIT
> == 41 today, with VM_UFFD_MINOR == VM_NONE on 32-bit so no VMA
> actually carries the bit).
Yeah ugh.
>
> Passing such a bit to mk_vma_flags() goes through __set_bit(41,
> &one_long) and writes one word past the end. The compiler folds
> the OOB store with wraparound (1UL << (41 % 32) == bit 9) into
> the first word. Bit 9 is already in __VMA_UFFD_FLAGS so the mask
> happens to come out right today, but any high-numbered bit whose
That is... helpful :) but not great that this is the situation, an
oversight, clearly! How I hate 32-bit kernels :)
> mod-BITS_PER_LONG position is otherwise unused would silently OR
> an extra bit into the mask.
>
> Add VMA_NO_BIT and have DECLARE_VMA_BIT() resolve any bitnum out
> of range to it. vma_flags_set_flag() drops negative bit values.
> The ternary collapses at compile time, the runtime check folds
> away when the bit is in range, and the common path is unchanged.
Hmm are you sure it does?
A key design goal was that mk_vma_flags() generates compile-time constants
the same as if the bitmap were constructed independently.
This surely must generate code? Or at least runs a significant risk of it?
Setting a precedent here would probably lead to VMA_NO_BIT being used more
and therefore generating code in more places I think.
And I'm not sure I really want to bend over backwards here to work around
issues with 32-bit kernels when in the long run the intent is that we move
to making these values 64-bit anyway.
Perhaps it's even wise possibly to just make these values 64-bit already,
ahead of time?
(I did look at this in terms of wanting something like a VMA_NO_BIT so we
could get VM_NONE-like behaviour but nothing I tried failed to generate
code.)
A simple solution that doesn't require change to the core is to just uglify
userfaultfd_k.h a bit with:
#ifdef HAVE_ARCH_USERFAULTFD_MINOR
#define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT, \
VMA_UFFD_MINOR_BIT)
#else
#define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT)
#endif
But of course that becomes much more horrible with your changes...
Another alternative, which I used for VMA_DROPPABLE is to add something
like this in mm.h:
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
#define VM_UFFD_MINOR INIT_VM_FLAG(UFFD_MINOR)
+define VMA_UFFD_MINOR mk_vma_flags(VMA_UFFD_MINOR_BIT)
#else
#define VM_UFFD_MINOR VM_NONE
+define VMA_UFFD_MINOR EMPTY_VMA_FLAGS
#endif
Then we can do:
#define __VMA_UFFD_FLAGS append_vma_flags(VMA_MINOR, VMA_UFFD_MISSING_BIT, \
VMA_UFFD_WP_BIT)
With you changes in 08/18 on top it'd get hairier, but we could make our
life easier by implementing something like:
static __always_inline vma_flags_t __mk_vma_flags_from_masks(size_t count,
const vma_flags_t *masks)
{
vma_flags_t flags = EMPTY_VMA_FLAGS;
int i;
for (i = 0; i < count; i++)
mask = vma_flags_set_mask(&flags, masks[i]);
return flags;
}
#define mk_vma_flags_from_masks(...) __mk_vma_flags_from_masks(, \
COUNT_ARGS(__VA_ARGS__), (const vma_flags_t []){__VA_ARGS__})
(untested code - you would need to ensure it generated equivalent
constants as VM_xxx would now :)
Then you could get VMA_UFFD_RWP with:
#ifdef CONFIG_USERFAULTFD_RWP
#define VMA_UFFD_RWP mk_vma_flags(VMA_UFFD_RWP_BIT)
#else
#define VMA_UFFD_RWP EMPTY_VMA_FLAGS
#endif
And then:
/* Always available if CONFIG_USERFAULTFD set. */
#define __VMA_UFFD_DEFAULT_FLAGS \
mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT)
#define __VMA_UFFD_FLAGS mk_vma_flags_from_masks(__VMA_UFFD_DEFAULT_FLAGS \
VMA_MINOR, VMA_RWP)
Which is kind ok-ish right? I mean it's all a bit fugly obviously.
>
> Bits declared in the enum are now safe to pass to mk_vma_flags()
> regardless of arch.
I mean another issue here is we're then codifying behaviour that's legacy
ahead of time. I really want to avoid that.
>
> Fixes: 9ea35a25d51b ("mm: introduce VMA flags bitmap type")
> Cc: stable@vger.kernel.org
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
> include/linux/mm.h | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0f2612a70fb1..71b11945e4fc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -286,8 +286,17 @@ extern unsigned int kobjsize(const void *objp);
> */
> typedef int __bitwise vma_flag_t;
>
> -#define DECLARE_VMA_BIT(name, bitnum) \
> - VMA_ ## name ## _BIT = ((__force vma_flag_t)bitnum)
> +/*
> + * VMA_NO_BIT means "no bit"; mk_vma_flags() skips it. DECLARE_VMA_BIT()
> + * below uses it for any bit number that doesn't fit in the bitmap, so
> + * callers don't need to track which bits are valid on the current build.
> + */
> +#define VMA_NO_BIT ((__force vma_flag_t)-1)
> +
> +#define DECLARE_VMA_BIT(name, bitnum) \
> + VMA_ ## name ## _BIT = (((bitnum) < NUM_VMA_FLAG_BITS) ? \
> + ((__force vma_flag_t)(bitnum)) : \
> + VMA_NO_BIT)
> #define DECLARE_VMA_BIT_ALIAS(name, aliased) \
> VMA_ ## name ## _BIT = (VMA_ ## aliased ## _BIT)
> enum {
> @@ -1081,6 +1090,8 @@ static __always_inline void vma_flags_set_flag(vma_flags_t *flags,
> {
> unsigned long *bitmap = flags->__vma_flags;
>
> + if ((__force int)bit < 0)
> + return;
> __set_bit((__force int)bit, bitmap);
> }
>
> --
> 2.54.0
>
Either way, I think we should break out any fix like this from the series.
Andrew is I think also not a fan of fixes patches in the middle of series
anyway :P
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag
2026-05-29 13:07 ` Kiryl Shutsemau
@ 2026-05-29 14:00 ` Lorenzo Stoakes
0 siblings, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2026-05-29 14:00 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team
On Fri, May 29, 2026 at 02:07:16PM +0100, Kiryl Shutsemau wrote:
> On Fri, May 29, 2026 at 08:24:55AM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 26, 2026 at 02:04:56PM +0100, Kiryl Shutsemau wrote:
> > > From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
> > >
> > > Preparatory patch for userfaultfd read-write protection (RWP). RWP
> > > extends userfaultfd protection from plain write-protection (WP) to
> > > full read-write protection: accesses to an RWP-protected range --
> > > reads as well as writes -- trap through userfaultfd.
> > >
> > > Reserve VM_UFFD_RWP, add the userfaultfd_rwp() and
> > > userfaultfd_protected() helpers, and wire up the smaps "ur" entry and
> > > the trace-flag table the rest of the series will use. The flag is
> > > gated on CONFIG_USERFAULTFD_RWP, which is introduced together with the
> > > UAPI in a later patch; until then VM_UFFD_RWP aliases VM_NONE and
> > > every downstream check folds to dead code.
> > >
> > > Nothing sets or queries the flag yet.
> > >
> > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > > Assisted-by: Claude:claude-opus-4-6
> >
> > Hm, if you've just used claude to bounce ideas off, I'm really not sure if
> > it's necessary to disclose, though I respect your thoroughness for doing so
> > :)
>
> I've elaborated on how I used Claude in reply to Andrew:
>
> https://lore.kernel.org/all/af5eALk9yO8pPcHv@thinkstation
>
> It is more than bouncing ideas.
Ah interesting! Fair enough then.
>
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 71b11945e4fc..6499cfb61dc4 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -362,6 +362,7 @@ enum {
> > > #endif
> > > DECLARE_VMA_BIT(UFFD_MINOR, 41),
> > > DECLARE_VMA_BIT(SEALED, 42),
> > > + DECLARE_VMA_BIT(UFFD_RWP, 43),
> >
> > I'm guessing CONFIG_USERFAULTFD_RWP is predicated on CONFIG_64BIT?
>
> Yes:
> depends on 64BIT && ARCH_HAS_PTE_PROTNONE && HAVE_ARCH_USERFAULTFD_WP
> >
> > It's a silly situation and once my VMA flags stuff is done it'll be
> > eliminated but for now... :)
>
> Yeah. I actually would appreciate your take on 04/18. It is related.
I'll reply there.
>
> > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > index f4cf5763f92c..0aef628514df 100644
> > > --- a/include/linux/userfaultfd_k.h
> > > +++ b/include/linux/userfaultfd_k.h
> > > @@ -21,10 +21,11 @@
> > > #include <linux/hugetlb_inline.h>
> > >
> > > /* The set of all possible UFFD-related VM flags. */
> > > -#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> > > +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_MINOR | \
> > > + VM_UFFD_WP | VM_UFFD_RWP)
> > >
> > > #define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT, \
> > > - VMA_UFFD_MINOR_BIT)
> > > + VMA_UFFD_MINOR_BIT, VMA_UFFD_RWP_BIT)
> > >
> > > /*
> > > * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> > > @@ -178,7 +179,7 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > > */
> > > static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> > > {
> > > - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> > > + return vma->vm_flags & (VM_UFFD_MINOR | VM_UFFD_WP | VM_UFFD_RWP);
> >
> > While we're here we might as well switch to using the new API?
> >
> > Can do:
> >
> > return vma_test_any_mask(vma, __VMA_UFFD_FLAGS);
> >
> > One unfortunate thing is using bit values means we can't do the VM_NONE
> > trick, but if !CONFIG_USERFAULTFD_RWP then VMA_UFFD_RWP_BIT wouldn't be set
> > anyway, same for minor so this should be fine?
>
> I think we need to decide first if the 04/18 direction is right.
> We can define VMA_UFFD_RWP_BIT to VMA_NO_BIT if !CONFIG_USERFAULTFD_RWP.
Yeah I don't think that approach is workable unfortunately (see my reply
there).
But I suggest some workarounds there with VMA_UFFD_RWP.
>
> > > }
> > >
> > > /*
> > > @@ -208,6 +209,16 @@ static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > > return vma->vm_flags & VM_UFFD_MINOR;
> > > }
> > >
> > > +static inline bool userfaultfd_rwp(struct vm_area_struct *vma)
> > > +{
> > > + return vma->vm_flags & VM_UFFD_RWP;
> > > +}
> >
> > Can be:
> >
> > return vma_test(vma, VMA_UFFD_RWP_BIT);
>
> Yep.
With the approach suggested in 04/18 we could do this as:
return vma_test_any_mask(vma, VMA_UFFD_RWP);
Which is a bit fugly but will work.
>
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 04/18] mm: skip out-of-range bits in mk_vma_flags()
2026-05-29 14:00 ` Lorenzo Stoakes
@ 2026-05-29 16:09 ` Kiryl Shutsemau
0 siblings, 0 replies; 25+ messages in thread
From: Kiryl Shutsemau @ 2026-05-29 16:09 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, rppt, peterx, david, surenb, vbabka, Liam.Howlett, ziy,
corbet, skhan, seanjc, pbonzini, jthoughton, aarcange, sj,
usama.arif, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kvm, kernel-team, stable
On Fri, May 29, 2026 at 03:00:14PM +0100, Lorenzo Stoakes wrote:
> > Add VMA_NO_BIT and have DECLARE_VMA_BIT() resolve any bitnum out
> > of range to it. vma_flags_set_flag() drops negative bit values.
> > The ternary collapses at compile time, the runtime check folds
> > away when the bit is in range, and the common path is unchanged.
>
> Hmm are you sure it does?
You were right - I measured it (gcc 15.2, clang 21.1.8, -O2). The
DECLARE_VMA_BIT() ternary is fine, but the "if (bit < 0)" guard does not
reliably fold: with it, clang stops folding __VMA_UFFD_FLAGS to a constant
and gcc keeps a rolled loop; without it, both fold.
So I've dropped VMA_NO_BIT and gone with your config-gated-mask approach
instead: mk_vma_flags_from_masks() plus VMA_UFFD_{MISSING,WP,MINOR,RWP}
masks that collapse to EMPTY_VMA_FLAGS when unavailable, so no out-of-range
bit ever reaches mk_vma_flags(). __VMA_UFFD_FLAGS now folds to a single
constant on both compilers, 32- and 64-bit. Added your Suggested-by.
I also took your "use the new API" hint and added a prep patch converting
the existing userfaultfd_*() helpers to vma_test_any_mask() (Suggested-by
you as well). One deviation: vma_test(vma, VMA_UFFD_RWP_BIT) is itself an
out-of-bounds *read* on 32-bit (test_bit(43, &one_long)), so the helpers
use vma_test_any_mask() with the masks rather than the bit.
> Either way, I think we should break out any fix like this from the series.
Agreed - the OOB fix and the other pre-existing fixes will go as a separate
series with the RWP work rebased on top.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2026-05-29 16:10 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-26 13:04 [PATCH v5 00/18] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 01/18] fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 02/18] mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 03/18] userfaultfd: gate must_wait writability check on pte_present() Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 04/18] mm: skip out-of-range bits in mk_vma_flags() Kiryl Shutsemau
2026-05-29 14:00 ` Lorenzo Stoakes
2026-05-29 16:09 ` Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 05/18] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 06/18] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 07/18] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 08/18] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau
2026-05-29 7:24 ` Lorenzo Stoakes
2026-05-29 13:07 ` Kiryl Shutsemau
2026-05-29 14:00 ` Lorenzo Stoakes
2026-05-26 13:04 ` [PATCH v5 09/18] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau
2026-05-29 1:19 ` SeongJae Park
2026-05-26 13:04 ` [PATCH v5 10/18] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau
2026-05-26 13:04 ` [PATCH v5 11/18] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 12/18] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 13/18] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 14/18] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 15/18] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 16/18] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 17/18] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau
2026-05-26 13:05 ` [PATCH v5 18/18] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox