The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs
@ 2026-07-03 17:38 Usama Arif
  2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

This is the PMD swap entry core series. The preparatory PMD softleaf [1]
cleanup series has been merged in akpm/mm-new, so this series is now
based directly on akpm/mm-new.

When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
split into HPAGE_PMD_NR PTE-level swap entries via TTU_SPLIT_HUGE_PMD
before unmap.  This series introduces a PMD-level swap entry so the
huge mapping can survive the swap round-trip and do_huge_pmd_swap_page()
can restore the PMD mapping directly on swap-in, without waiting for
khugepaged to collapse the range later.

The PMD swap entry is a compact page-table encoding for HPAGE_PMD_NR
consecutive swap slots.  swap_map accounting remains per-slot and is
unchanged.  Importantly, a PMD swap entry does not promise that the swap
cache always contains one PMD-sized folio.  While the cache is empty or
contains one PMD-sized folio, PMD-level handling can proceed.  Once the
cache has split/per-slot state, users either inspect the individual
slots directly (mincore) or split the PMD swap entry and retry through
the PTE path (fault, swapoff, MADV_WILLNEED, UFFDIO_MOVE).  Likewise,
if any slot is still backed by zswap's per-page store, PMD-order
swap-in consumers split and let the PTE path load the range page by page;
an all-on-disk range can still be read back as one PMD-sized folio.

The series is ordered so every consumer can handle PMD swap entries
before the swap-out producer starts installing them.  The swap-out patch
is the last functional change.

Patch breakdown:

  1. mm: add PMD swap entry detection support
     Add pmd_is_swap_entry(), teach the softleaf layer that PMD swap
     entries are valid PMD softleaf entries, and add per-arch
     pmd_swp_*exclusive helpers.

  2. mm: add PMD swap entry splitting support
     Teach __split_huge_pmd_locked() to split a PMD swap entry into
     HPAGE_PMD_NR PTE swap entries.  This is the common fallback path
     whenever PMD-level handling is unsafe or unavailable.

  3. mm: handle PMD swap entries in fork path
     Copy a PMD swap entry in fork via one
     swap_dup_entries_direct(HPAGE_PMD_NR) operation, with the same
     swapoff/mmlist preparation rules used by PTE entries.

  4. mm: zswap: add range lookup for large-folio swapin
     Teach zswap_load() to let all-on-disk large folio reads proceed to
     the backing swap device, and add zswap_range_has_entry() so PMD
     swap-entry consumers can split only when a specific range has
     per-page zswap state.

  5. mm: swap in PMD swap entries as whole THPs during swapoff
     Add swap_pmd_cache_lookup() and use it from swapoff.  Empty cache
     and PMD-sized cache state can be handled at PMD order; split cache
     state, zswap-backed slots, allocation/read failure, or non-uptodate
     folios split the PMD and fall back to unuse_pte_range().

  6. mm: handle PMD swap entries in non-present PMD walkers
     Teach zap, mprotect, soft-dirty, uffd-wp, smaps, mincore,
     mempolicy, khugepaged, HMM, and madvise walkers about PMD swap
     entries.  mincore reports PMD-sized cache state directly and checks
     per-page slots after the cache has split.

  7. mm: handle PMD swap entries in MADV_WILLNEED
     Let MADV_WILLNEED prefetch a PMD swap entry at PMD order when safe,
     treat an already cached PMD-sized folio as complete, and split/retry
     through PTEs for split cache state, zswap-backed slots, or races
     with per-slot cache population.

  8. mm: handle PMD swap entries in UFFDIO_MOVE
     Move PMD swap entries whole when the covered range is empty or
     backed by one PMD-sized folio.  Split/per-slot cache state returns
     -EAGAIN after splitting so retry can use the PTE move path and
     update per-page rmap metadata.

  9. mm: handle PMD swap entry faults on swap-in
     Add do_huge_pmd_swap_page().  It maps a PMD-sized cached folio
     directly, or allocates/reads at PMD order when the cache is empty
     and the range has no zswap entries.  Split cache state, zswap-backed
     slots, and PMD-order resource failures fall back to the PTE path.

 10. mm: install PMD swap entries on swap-out
     Stop forcing TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios
     and install one PMD swap entry instead.  Zswap still stores the THP
     as per-page compressed entries; PMD-order swap-in consumers preserve
     a PMD-sized cached folio or read an all-on-disk range as a whole THP,
     and split before reading per-page zswap state.

 11. selftests/mm: add PMD swap entry tests
     Add pmd_swap selftests covering swap-out/in, fork, fork+COW,
     repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
     mincore, MADV_FREE, MADV_WILLNEED, UFFDIO_MOVE, and swapoff.

Notes on zswap:

  Native PMD-order zswap load/store is intentionally left for a follow-up.
  Alexandre Ghiti is currently working this.
  This series can still preserve PMD swap entries while zswap is enabled:
  zswap stores the THP as order-0 entries, and PMD-order swap-in
  consumers split any range that has zswap entries before reading it.  If
  zswap has written the whole range back to disk, or the swap cache still
  contains one PMD-sized folio, PMD-level handling can proceed.

v2 -> v3: https://lore.kernel.org/all/20260602142537.198755-1-usama.arif@linux.dev/
- Clarified the PMD swap entry rule: it is a compact encoding for
  HPAGE_PMD_NR swap slots, not a guarantee that swap cache always has
  one PMD-sized folio. (Lance Yang)
- Swapoff, fault, MADV_WILLNEED, and UFFDIO_MOVE now classify the
  whole PMD swap-cache range and split/retry through the PTE path for
  split/per-slot cache state. (Lance Yang)
- mincore handles PMD swap entries without assuming one lookup covers
  a split swap-cache range. (Lance Yang)
- UFFDIO_MOVE rechecks all HPAGE_PMD_NR slots before moving an empty
  PMD swap-cache range, avoiding stale rmap metadata for per-slot
  cached folios.
- Added a standalone zswap prerequisite patch from Alexandre that
  distinguishes all-on-disk large-folio ranges from ranges with
  per-page zswap entries.
- Replaced the global zswap-ever-enabled policy with per-range zswap
  checks: PMD swap entries can still be installed while zswap is
  enabled, and PMD-order swap-in consumers split when the range has
  per-page zswap state.
- Added a mincore selftest and updated MADV_WILLNEED coverage so the
  test checks that the PMD swap entry remains in place until first
  touch.  Total pmd_swap coverage is now 14 tests.


v1 -> v2: https://lore.kernel.org/all/20260427100553.2754667-1-usama.arif@linux.dev/
- Patch 1: convert two additional softleaf_to_pmd() callers that
  landed in mm-unstable since v1 (mm/debug_vm_pgtable.c,
  mm/migrate_device.c) (Dev)
- Patch 2: rename helper ensure_on_mmlist() to
  mm_prepare_for_swap_entries() to better describe its purpose
  (David)
- Patch 3: drop VM_WARN_ON_ONCE(!pmd_is_migration_entry) as
  Dev posted it as a separate patch.
- Patch 5 (new): move softleaf_to_folio() inside the device-private
  branch in migrate_vma_collect_pmd(); same class of fix as patch 4
  but for the migrate-device PMD walker.
- Patch 6 (new): rename CONFIG_ARCH_ENABLE_THP_MIGRATION to
  CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF so the gate that now drives
  swap-entry support too is named for what it actually controls
  (PMD softleaf entries), not just migration. (Dev)
- Patch 7: add the missing pmd_swp_exclusive / mkexclusive /
  clear_exclusive helpers for powerpc.
- Patches 10 and 14: use upstream swapin_sync() (bundles
  swap_cache_alloc_folio + swap_read_folio + the -EEXIST race
  retry) instead of the bespoke swapin_alloc_pmd_folio() helper
  from v1; do_swap_page and shmem_swapin_folio use the same
  helper (Kairui)
- Patch 10: construct a stack vm_fault for the swapoff swap-in so
  the allocator can resolve a mempolicy, mirroring how the PTE
  swapoff path (unuse_pte_range) already does it.
- Patch 11: extend coverage to check_pmd_state() in khugepaged so a
  swapped-out PMD-mapped THP is treated as SCAN_PMD_MAPPED (matches
  the existing migration-entry handling). Route the
  pmd_trans_huge_lock() branch of mincore_pte_range() through
  mincore_swap() so a swapped-out PMD-mapped THP isn't reported as
  resident.
- Patch 12 (new): handle PMD swap entries in MADV_WILLNEED via
  swapin_sync(BIT(HPAGE_PMD_ORDER)); a naive order-0 read-ahead
  would force the subsequent fault to split.
- Patch 13: refuse UFFDIO_MOVE with -EBUSY if the swap-cache folio
  was split between swap-out and the move, matching
  move_pages_pte()'s rejection of large folios; otherwise only one
  of the 512 anon-rmaps would be re-anchored to dst_vma.
- Patch 16: alloc_fill_swap_thp() now uses the existing
  mmap_pmd_aligned() helper so tests don't flake/skip based on VA
  placement; new MADV_WILLNEED test that watches the PMD-order
  mTHP swpin counter; swapoff test restructured to use the
  kselftest_harness ASSERT cleanup blocks (no double swapoff, no
  verify-after-munmap).
- Collected Acks and Reviews.

[1] https://lore.kernel.org/all/20260630164143.1595669-1-usama.arif@linux.dev/ 
 
Alexandre Ghiti (1):
  mm: zswap: add range lookup for large-folio swapin

Usama Arif (10):
  mm: add PMD swap entry detection support
  mm: add PMD swap entry splitting support
  mm: handle PMD swap entries in fork path
  mm: swap in PMD swap entries as whole THPs during swapoff
  mm: handle PMD swap entries in non-present PMD walkers
  mm: handle PMD swap entries in MADV_WILLNEED
  mm: handle PMD swap entries in UFFDIO_MOVE
  mm: handle PMD swap entry faults on swap-in
  mm: install PMD swap entries on swap-out
  selftests/mm: add PMD swap entry tests

 arch/arm64/include/asm/pgtable.h             |   4 +
 arch/loongarch/include/asm/pgtable.h         |  17 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |  15 +
 arch/riscv/include/asm/pgtable.h             |  15 +
 arch/s390/include/asm/pgtable.h              |  15 +
 arch/x86/include/asm/pgtable.h               |  15 +
 fs/proc/task_mmu.c                           |  43 +-
 include/linux/huge_mm.h                      |  11 +
 include/linux/leafops.h                      |  24 +-
 include/linux/pgtable.h                      |  17 +
 include/linux/swap.h                         |   4 +-
 include/linux/vm_event_item.h                |   1 +
 include/linux/zswap.h                        |   7 +
 mm/hmm.c                                     |   3 +-
 mm/huge_memory.c                             | 565 ++++++++++++++-
 mm/internal.h                                |  36 +
 mm/khugepaged.c                              |   6 +
 mm/madvise.c                                 |  89 ++-
 mm/memory.c                                  |  42 +-
 mm/mempolicy.c                               |   2 +
 mm/mincore.c                                 |  45 +-
 mm/rmap.c                                    |  20 +
 mm/swap.h                                    |  17 +
 mm/swap_state.c                              |  44 ++
 mm/swapfile.c                                | 161 ++++-
 mm/vmscan.c                                  |   9 +-
 mm/vmstat.c                                  |   1 +
 mm/zswap.c                                   |  46 +-
 tools/testing/selftests/mm/Makefile          |   1 +
 tools/testing/selftests/mm/pmd_swap.c        | 702 +++++++++++++++++++
 30 files changed, 1878 insertions(+), 99 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pmd_swap.c

-- 
2.53.0-Meta


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-07-04  8:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
2026-07-03 17:38 ` [PATCH v3 02/11] mm: add PMD swap entry splitting support Usama Arif
2026-07-03 17:38 ` [PATCH v3 03/11] mm: handle PMD swap entries in fork path Usama Arif
2026-07-03 17:38 ` [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin Usama Arif
2026-07-03 17:38 ` [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-07-03 17:38 ` [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-07-03 17:38 ` [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
2026-07-03 17:38 ` [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-07-03 17:38 ` [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-07-03 17:38 ` [PATCH v3 10/11] mm: install PMD swap entries on swap-out Usama Arif
2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
2026-07-04  6:27   ` kernel test robot
2026-07-04  8:30   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox