All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/16] userfaultfd: working set tracking for VM guest memory
@ 2026-05-22 13:38 Kiryl Shutsemau
  2026-05-22 13:38 ` [PATCH v3 01/16] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau
                   ` (15 more replies)
  0 siblings, 16 replies; 23+ messages in thread
From: Kiryl Shutsemau @ 2026-05-22 13:38 UTC (permalink / raw)
  To: akpm, rppt, peterx, david
  Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
	pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team,
	linux-man, alx, Kiryl Shutsemau (Meta)

From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>

This series adds userfaultfd support for tracking the working set of
VM guest memory, so a VMM can identify cold pages and evict them to
tiered or remote storage.

v1: https://lore.kernel.org/all/20260427114607.4068647-1-kas@kernel.org/
v2: https://lore.kernel.org/all/cover.1778254670.git.kas@kernel.org/

== Changes since v2 ==

Review feedback from Mike Rapoport and SeongJae Park; tags folded in.

  - 03/16: rename uffd_wp local in copy_hugetlb_page_range() (SJ).
  - 04/16: group mode and protection bits in __VM_UFFD_FLAGS et al;
    move CONFIG_USERFAULTFD_RWP to 08/16 where the UAPI lands.
  - 05/16: fold uffd_wp/uffd_rwp bool pairs into uffd_prot in
    change_huge_pmd(), change_softleaf_pte(), change_present_ptes();
    nit fixes in comments.
  - 08/16: pre-scan rewritten as bool found; CONFIG_USERFAULTFD_RWP
    moved here from 04/16.
  - 09/16: line reflow.
  - 13/16: rewrite uffd_rwp_gup_test() with vmsplice() -- write()
    went through copy_from_user(), not gup_can_follow_protnone();
    plus selftest cleanups.
  - 14/16: documentation wording fixes.

Patches 15-16 are the matching userfaultfd(2) and ioctl_userfaultfd(2)
man-page updates against the linux-man tree. Apply with
"git am --directory=" or by hand in the man-pages repo.

== Problem ==

A VMM managing guest memory needs to:

  1. detect which pages are still being touched (working-set
     tracking);
  2. safely evict cold pages to slower tiered or remote storage;
  3. fetch them back on demand when accessed again.

== Approach ==

UFFDIO_REGISTER_MODE_RWP is a new userfaultfd registration mode, in
parallel with the existing MODE_MISSING / MODE_WP / MODE_MINOR. It
uses the same mechanism on every backing -- anon, shmem, hugetlbfs:

  - PAGE_NONE on the PTE (the same primitive NUMA balancing uses)
    makes the page inaccessible while keeping it resident;
  - the uffd PTE bit (the one MODE_WP already owns) marks the entry
    as "userfaultfd-tracked" so the protnone fault path can tell an
    RWP fault apart from an mprotect(PROT_NONE) or NUMA hinting
    fault.

VM_UFFD_WP and VM_UFFD_RWP are mutually exclusive per VMA, so the
same PTE bit safely carries both meanings depending on the
registered VMA flag.

In sync mode, the kernel delivers a UFFD_PAGEFAULT_FLAG_RWP message
to the registered handler, and the handler resolves the fault with
UFFDIO_RWPROTECT clearing MODE_RWP. In async mode
(UFFD_FEATURE_RWP_ASYNC), the fault is auto-resolved in-place: the
kernel restores the original PTE permissions and the faulting thread
continues without a userfaultfd message ever being delivered.
Userspace then learns which pages were touched by reading
PAGE_IS_ACCESSED out of PAGEMAP_SCAN -- pages whose uffd bit is
still set were not re-accessed since the last RWP cycle.

UFFDIO_RWPROTECT is the protect/unprotect ioctl, mirroring
UFFDIO_WRITEPROTECT.

UFFDIO_SET_MODE flips RWP_ASYNC <-> sync at runtime under
mmap_write_lock(), so a VMM can run in async mode for detection and
switch to sync for race-free eviction without re-registering the
userfaultfd.

== Typical VMM workflow ==

  /* arm */
  UFFDIO_API(features = RWP | RWP_ASYNC)
  UFFDIO_REGISTER(MODE_RWP)

  /* detection cycle */
  UFFDIO_RWPROTECT(range, RWP)
  sleep(interval)
  PAGEMAP_SCAN(!PAGE_IS_ACCESSED) -> cold pages

  /* eviction */
  UFFDIO_SET_MODE(disable = RWP_ASYNC)                  /* sync */
  pwrite(cold) + fallocate(FALLOC_FL_PUNCH_HOLE, cold)  /* races trapped */
  UFFDIO_SET_MODE(enable  = RWP_ASYNC)                  /* resume */

== Series layout ==

Patches 1 to 3 are preparatory:

  1: decouple protnone helpers from CONFIG_NUMA_BALANCING.
  2-3: rename _PAGE_BIT_UFFD_WP, pte_uffd_wp() and friends to drop
       the _WP suffix, since the bit now carries WP and RWP meaning
       depending on the VMA flag. The SCAN_PTE_UFFD enum's ftrace
       output string is intentionally kept as "pte_uffd_wp" so
       trace-based tooling does not silently break.

Patches 4 to 7 add the in-kernel mechanism:

  4: VM_UFFD_RWP VMA flag (aliased to VM_NONE until 8/16 introduces
     CONFIG_USERFAULTFD_RWP together with the UAPI).
  5: MM_CP_UFFD_RWP change_protection() primitive (PAGE_NONE +
     uffd bit, plus a RESOLVE counterpart).
  6: marker preservation across swap, device-exclusive, migration,
     fork, mremap, UFFDIO_MOVE, hugetlb copy, and mprotect().
  7: handle VM_UFFD_RWP in khugepaged, rmap, and GUP.

Patches 8 to 12 wire the userspace surface:

   8: UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing
      (introduces CONFIG_USERFAULTFD_RWP).
   9: RWP fault delivery and exposure of UFFDIO_REGISTER_MODE_RWP.
  10: PAGE_IS_ACCESSED in PAGEMAP_SCAN.
  11: UFFD_FEATURE_RWP_ASYNC for async fault resolution.
  12: UFFDIO_SET_MODE for runtime sync/async toggle.

Patches 13 and 14 are kernel tests and Documentation/. Patches 15 and
16 update userfaultfd(2) and ioctl_userfaultfd(2) in the linux-man
tree.

Kiryl Shutsemau (Meta) (16):
  mm: decouple protnone helpers from CONFIG_NUMA_BALANCING
  mm: rename uffd-wp PTE bit macros to uffd
  mm: rename uffd-wp PTE accessors to uffd
  mm: add VM_UFFD_RWP VMA flag
  mm: add MM_CP_UFFD_RWP change_protection() flag
  mm: preserve RWP marker across PTE rewrites
  mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
  userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT
    plumbing
  mm/userfaultfd: add RWP fault delivery and expose
    UFFDIO_REGISTER_MODE_RWP
  mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking
  userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution
  userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
  selftests/mm: add userfaultfd RWP tests
  Documentation/userfaultfd: document RWP working set tracking
  userfaultfd.2: Add read-write protect mode
  ioctl_userfaultfd.2: Add read-write protect mode docs

 Documentation/admin-guide/mm/pagemap.rst     |  13 +-
 Documentation/admin-guide/mm/userfaultfd.rst | 236 +++++-
 Documentation/filesystems/proc.rst           |   1 +
 arch/arm64/Kconfig                           |   1 +
 arch/arm64/include/asm/pgtable-prot.h        |   8 +-
 arch/arm64/include/asm/pgtable.h             |  47 +-
 arch/loongarch/Kconfig                       |   1 +
 arch/loongarch/include/asm/pgtable.h         |   4 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |   8 +-
 arch/powerpc/platforms/Kconfig.cputype       |   1 +
 arch/riscv/Kconfig                           |   1 +
 arch/riscv/include/asm/pgtable-bits.h        |  12 +-
 arch/riscv/include/asm/pgtable.h             |  59 +-
 arch/s390/Kconfig                            |   1 +
 arch/s390/include/asm/hugetlb.h              |  12 +-
 arch/s390/include/asm/pgtable.h              |   4 +-
 arch/x86/Kconfig                             |   1 +
 arch/x86/include/asm/pgtable.h               |  56 +-
 arch/x86/include/asm/pgtable_types.h         |  16 +-
 fs/proc/task_mmu.c                           | 108 ++-
 fs/userfaultfd.c                             | 263 ++++++-
 include/asm-generic/hugetlb.h                |  18 +-
 include/asm-generic/pgtable_uffd.h           |  32 +-
 include/linux/huge_mm.h                      |   7 +
 include/linux/leafops.h                      |   4 +-
 include/linux/mm.h                           |  46 +-
 include/linux/mm_inline.h                    |   4 +-
 include/linux/pgtable.h                      |  32 +-
 include/linux/swapops.h                      |   4 +-
 include/linux/userfaultfd_k.h                |  78 +-
 include/trace/events/huge_memory.h           |   2 +-
 include/trace/events/mmflags.h               |   7 +
 include/uapi/linux/fs.h                      |   1 +
 include/uapi/linux/userfaultfd.h             |  54 +-
 init/Kconfig                                 |   8 +
 mm/Kconfig                                   |   9 +
 mm/debug_vm_pgtable.c                        |   4 +-
 mm/huge_memory.c                             | 155 +++-
 mm/hugetlb.c                                 | 146 +++-
 mm/internal.h                                |   4 +-
 mm/khugepaged.c                              |  38 +-
 mm/memory.c                                  | 123 ++-
 mm/migrate.c                                 |  20 +-
 mm/migrate_device.c                          |   8 +-
 mm/mprotect.c                                |  68 +-
 mm/mremap.c                                  |  17 +-
 mm/page_table_check.c                        |   8 +-
 mm/rmap.c                                    |  18 +-
 mm/swapfile.c                                |   9 +-
 mm/userfaultfd.c                             | 112 ++-
 tools/include/uapi/linux/fs.h                |   1 +
 tools/testing/selftests/mm/uffd-unit-tests.c | 766 +++++++++++++++++++
 man2/ioctl_userfaultfd.2                     | 209 ++++++++++++++++++++++++++++++++++++++-
 man2/userfaultfd.2                           | 147 ++++++++++++++++++++++++++-
 54 files changed, 2586 insertions(+), 426 deletions(-)

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.51.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-05-24  6:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 13:38 [PATCH v3 00/16] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 01/16] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 02/16] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 03/16] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 04/16] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 05/16] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau
2026-05-23 10:03   ` Mike Rapoport
2026-05-22 13:38 ` [PATCH v3 06/16] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 07/16] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 08/16] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 09/16] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 10/16] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 11/16] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 12/16] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau
2026-05-22 13:38 ` [PATCH v3 13/16] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau
2026-05-23 10:07   ` Mike Rapoport
2026-05-22 13:38 ` [PATCH v3 14/16] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau
2026-05-23 10:30   ` Mike Rapoport
2026-05-22 13:38 ` [PATCH v3 15/16] userfaultfd.2: Add read-write protect mode Kiryl Shutsemau
2026-05-23 10:37   ` Mike Rapoport
2026-05-24  0:08     ` Kiryl Shutsemau
2026-05-24  6:32       ` Mike Rapoport
2026-05-22 13:38 ` [PATCH v3 16/16] ioctl_userfaultfd.2: Add read-write protect mode docs Kiryl Shutsemau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.