All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/15] userfaultfd: working set tracking for VM guest memory
@ 2026-06-29 12:07 Kiryl Shutsemau
  2026-06-29 12:07 ` [PATCH v7 01/15] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau
                   ` (14 more replies)
  0 siblings, 15 replies; 23+ messages in thread
From: Kiryl Shutsemau @ 2026-06-29 12:07 UTC (permalink / raw)
  To: akpm, rppt, peterx, david
  Cc: ljs, surenb, vbabka, Liam.Howlett, ziy, corbet, skhan, seanjc,
	pbonzini, jthoughton, aarcange, sj, usama.arif, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, kvm, kernel-team, kas

From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>

This series adds userfaultfd support for tracking the working set of
VM guest memory, so a VMM can identify hot pages and reclaim cold ones
to tiered or remote storage.

v1: https://lore.kernel.org/all/20260427114607.4068647-1-kas@kernel.org/
v2: https://lore.kernel.org/all/cover.1778254670.git.kas@kernel.org/
v3: https://lore.kernel.org/all/20260522133857.552279-1-kirill@shutemov.name/
v4: https://lore.kernel.org/all/20260525113737.1942478-1-kas@kernel.org/
v5: https://lore.kernel.org/all/20260526130509.2748441-1-kirill@shutemov.name/
v6: https://lore.kernel.org/all/20260529172716.357179-1-kas@kernel.org/

== Changes since v6 ==

  - Rebased onto v7.2-rc1. v6 was stacked on the separate
    "userfaultfd/pagemap: pre-existing fixes" series; those fixes have
    since landed, so v7 applies directly on v7.2-rc1 with no out-of-tree
    dependency. The only rebase adaptation is that the accessor-rename
    patch now also covers two call sites in remove_migration_pmd() that
    appeared in v7.2-rc1.
  - Addressed Lorenzo Stoakes' review: the two VM_UFFD_RWP single-flag
    checks -- userfaultfd_rwp() and gup_can_follow_protnone() -- use
    vma_test_single_mask() instead of vma_test_any_mask().
  - Reworked the working-set documentation and the PAGEMAP_SCAN selftest
    around hot-page detection (PAGE_IS_ACCESSED, non-inverted) rather
    than a cold scan. A cold scan (inverted PAGE_IS_ACCESSED) cannot see
    file pages that are present in the page cache but not mapped, so it
    misreports never-faulted regions of a pre-populated file as cold.
    Tracking the hot set and reclaiming everything else from the backing
    file is the correct model and is now what the docs and tests show.
  - Documented a related file-THP limitation: RWP state is PTE-granular,
    but a file mapping faulted in as a PMD-level THP loses its RWP marks
    when the PMD is split -- split_huge_pmd() clears the file PMD via
    pmdp_huge_clear_flush() without redistributing the uffd marker to the
    PTEs (there is no PMD-level RWP marker), so the range silently reverts
    to untracked. Cold-page tracking over a transparently-huge file
    mapping is therefore unreliable; the selftest opts out of THP
    (MADV_NOHUGEPAGE) on the non-hugetlb backings, and the documented
    hot-set-plus-file-reclaim model sidesteps it. hugetlb is unaffected
    (its mappings are not split by split_huge_pmd()).
  - Collected review tags picked up during v6.

113/113 of tools/testing/selftests/mm/uffd-unit-tests pass on v7.2-rc1
(46 RWP cases plus the existing UFFD groups, no regressions).

== Problem ==

A VMM managing guest memory needs to:

  1. detect which pages are still being touched (working-set
     tracking);
  2. safely reclaim cold pages to slower tiered or remote storage;
  3. fetch them back on demand when accessed again.

== Approach ==

UFFDIO_REGISTER_MODE_RWP is a new userfaultfd registration mode, in
parallel with the existing MODE_MISSING / MODE_WP / MODE_MINOR. It
uses the same mechanism on every backing -- anon, shmem, hugetlbfs:

  - PAGE_NONE on the PTE (the same primitive NUMA balancing uses)
    makes the page inaccessible while keeping it resident;
  - the uffd PTE bit (the one MODE_WP already owns) marks the entry
    as "userfaultfd-tracked" so the protnone fault path can tell an
    RWP fault apart from an mprotect(PROT_NONE) or NUMA hinting
    fault.

VM_UFFD_WP and VM_UFFD_RWP are mutually exclusive per VMA, so the
same PTE bit safely carries both meanings depending on the
registered VMA flag.

In sync mode, the kernel delivers a UFFD_PAGEFAULT_FLAG_RWP message
to the registered handler, and the handler resolves the fault with
UFFDIO_RWPROTECT clearing MODE_RWP. In async mode
(UFFD_FEATURE_RWP_ASYNC), the fault is auto-resolved in-place: the
kernel restores the original PTE permissions and the faulting thread
continues without a userfaultfd message ever being delivered.
Userspace then learns which pages were touched during the cycle by
reading PAGE_IS_ACCESSED out of PAGEMAP_SCAN -- that set is the
working set; everything else is a reclaim candidate.

UFFDIO_RWPROTECT is the protect/unprotect ioctl, mirroring
UFFDIO_WRITEPROTECT.

UFFDIO_SET_MODE flips RWP_ASYNC <-> sync at runtime under
mmap_write_lock() + vma_start_write(), so a VMM can run in async
mode for detection and switch to sync for race-free reclaim without
re-registering the userfaultfd.

== Typical VMM workflow ==

  /* arm */
  UFFDIO_API(features = RWP | RWP_ASYNC)
  UFFDIO_REGISTER(MODE_RWP)

  /* detection cycle (async) */
  UFFDIO_RWPROTECT(range, RWP)
  sleep(interval)

  /* freeze the snapshot before scanning */
  UFFDIO_SET_MODE(disable = RWP_ASYNC)                  /* sync */
  PAGEMAP_SCAN(PAGE_IS_ACCESSED) -> hot pages (working set)

  /* reclaim everything not in the hot set from the backing file */
  fallocate(FALLOC_FL_PUNCH_HOLE, non-hot) /* or pwrite to remote */
  UFFDIO_SET_MODE(enable  = RWP_ASYNC)                  /* resume */

== Series layout ==

Patches 1 to 3 are preparatory:

  1: decouple protnone helpers from CONFIG_NUMA_BALANCING.
  2-3: rename _PAGE_BIT_UFFD_WP, pte_uffd_wp() and friends to drop
       the _WP suffix, since the bit now carries WP and RWP meaning
       depending on the VMA flag. The SCAN_PTE_UFFD enum's ftrace
       output string is intentionally kept as "pte_uffd_wp" so
       trace-based tooling does not silently break.

Patch 4 switches the uffd VMA-flag helpers to the vma_flags_t
accessors (vma_test_*_mask), so the VMA_UFFD_* masks are the single
place that knows which modes the build offers.

Patches 5 to 8 add the in-kernel mechanism:

  5: VM_UFFD_RWP VMA flag (aliased to VM_NONE until patch 9
     introduces CONFIG_USERFAULTFD_RWP together with the UAPI).
  6: MM_CP_UFFD_RWP change_protection() primitive (PAGE_NONE +
     uffd bit, plus a RESOLVE counterpart).
  7: marker preservation across swap, device-exclusive, migration,
     fork, mremap, UFFDIO_MOVE, hugetlb copy, and mprotect().
  8: handle VM_UFFD_RWP in khugepaged, rmap, and GUP.

Patches 9 to 13 wire the userspace surface:

  9:  UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing
      (introduces CONFIG_USERFAULTFD_RWP).
  10: RWP fault delivery and exposure of UFFDIO_REGISTER_MODE_RWP.
  11: PAGE_IS_ACCESSED in PAGEMAP_SCAN.
  12: UFFD_FEATURE_RWP_ASYNC for async fault resolution.
  13: UFFDIO_SET_MODE for runtime sync/async toggle.

Patches 14 and 15 are kernel tests and Documentation/. The matching
man-pages series is already upstream.

Kiryl Shutsemau (Meta) (15):
  mm: decouple protnone helpers from CONFIG_NUMA_BALANCING
  mm: rename uffd-wp PTE bit macros to uffd
  mm: rename uffd-wp PTE accessors to uffd
  userfaultfd: test uffd VMA flags through the vma_flags_t API
  mm: add VM_UFFD_RWP VMA flag
  mm: add MM_CP_UFFD_RWP change_protection() flag
  mm: preserve RWP marker across PTE rewrites
  mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP
  userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT
    plumbing
  mm/userfaultfd: add RWP fault delivery and expose
    UFFDIO_REGISTER_MODE_RWP
  mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking
  userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution
  userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
  selftests/mm: add userfaultfd RWP tests
  Documentation/userfaultfd: document RWP working set tracking

 Documentation/admin-guide/mm/pagemap.rst     |  13 +-
 Documentation/admin-guide/mm/userfaultfd.rst | 269 ++++++-
 Documentation/filesystems/proc.rst           |   1 +
 arch/arm64/Kconfig                           |   1 +
 arch/arm64/include/asm/pgtable-prot.h        |   8 +-
 arch/arm64/include/asm/pgtable.h             |  47 +-
 arch/loongarch/Kconfig                       |   1 +
 arch/loongarch/include/asm/pgtable.h         |   4 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |   8 +-
 arch/powerpc/platforms/Kconfig.cputype       |   1 +
 arch/riscv/Kconfig                           |   1 +
 arch/riscv/include/asm/pgtable-bits.h        |  12 +-
 arch/riscv/include/asm/pgtable.h             |  59 +-
 arch/s390/Kconfig                            |   1 +
 arch/s390/include/asm/hugetlb.h              |  12 +-
 arch/s390/include/asm/pgtable.h              |   4 +-
 arch/x86/Kconfig                             |   1 +
 arch/x86/include/asm/pgtable.h               |  56 +-
 arch/x86/include/asm/pgtable_types.h         |  16 +-
 fs/proc/task_mmu.c                           |  98 ++-
 include/asm-generic/hugetlb.h                |  18 +-
 include/asm-generic/pgtable_uffd.h           |  32 +-
 include/linux/huge_mm.h                      |   7 +
 include/linux/leafops.h                      |   4 +-
 include/linux/mm.h                           |  65 +-
 include/linux/mm_inline.h                    |   4 +-
 include/linux/pgtable.h                      |  32 +-
 include/linux/swapops.h                      |   4 +-
 include/linux/userfaultfd_k.h                |  89 ++-
 include/trace/events/huge_memory.h           |   2 +-
 include/trace/events/mmflags.h               |   7 +
 include/uapi/linux/fs.h                      |   1 +
 include/uapi/linux/userfaultfd.h             |  54 +-
 init/Kconfig                                 |   8 +
 mm/Kconfig                                   |   9 +
 mm/debug_vm_pgtable.c                        |   4 +-
 mm/huge_memory.c                             | 161 ++--
 mm/hugetlb.c                                 | 158 +++-
 mm/internal.h                                |   4 +-
 mm/khugepaged.c                              |  40 +-
 mm/memory.c                                  | 135 +++-
 mm/migrate.c                                 |  20 +-
 mm/migrate_device.c                          |   8 +-
 mm/mprotect.c                                |  70 +-
 mm/mremap.c                                  |  17 +-
 mm/page_table_check.c                        |   8 +-
 mm/rmap.c                                    |  18 +-
 mm/swapfile.c                                |   9 +-
 mm/userfaultfd.c                             | 387 ++++++++-
 tools/include/uapi/linux/fs.h                |   1 +
 tools/testing/selftests/mm/uffd-unit-tests.c | 781 +++++++++++++++++++
 51 files changed, 2333 insertions(+), 437 deletions(-)


base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
-- 
2.54.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-06-29 16:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 12:07 [PATCH v7 00/15] userfaultfd: working set tracking for VM guest memory Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 01/15] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 02/15] mm: rename uffd-wp PTE bit macros to uffd Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 03/15] mm: rename uffd-wp PTE accessors " Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 04/15] userfaultfd: test uffd VMA flags through the vma_flags_t API Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 05/15] mm: add VM_UFFD_RWP VMA flag Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 06/15] mm: add MM_CP_UFFD_RWP change_protection() flag Kiryl Shutsemau
2026-06-29 12:27   ` sashiko-bot
2026-06-29 12:07 ` [PATCH v7 07/15] mm: preserve RWP marker across PTE rewrites Kiryl Shutsemau
2026-06-29 12:33   ` sashiko-bot
2026-06-29 16:02     ` Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 08/15] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Kiryl Shutsemau
2026-06-29 12:50   ` sashiko-bot
2026-06-29 12:07 ` [PATCH v7 09/15] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Kiryl Shutsemau
2026-06-29 12:40   ` sashiko-bot
2026-06-29 12:07 ` [PATCH v7 10/15] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Kiryl Shutsemau
2026-06-29 12:42   ` sashiko-bot
2026-06-29 12:07 ` [PATCH v7 11/15] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 12/15] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 13/15] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau
2026-06-29 12:07 ` [PATCH v7 14/15] selftests/mm: add userfaultfd RWP tests Kiryl Shutsemau
2026-06-29 12:46   ` sashiko-bot
2026-06-29 12:07 ` [PATCH v7 15/15] Documentation/userfaultfd: document RWP working set tracking Kiryl Shutsemau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.