[PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs
@ 2026-07-03 17:38 Usama Arif
  2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

This is the PMD swap entry core series. The preparatory PMD softleaf [1]
cleanup series has been merged in akpm/mm-new, so this series is now
based directly on akpm/mm-new.

When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
split into HPAGE_PMD_NR PTE-level swap entries via TTU_SPLIT_HUGE_PMD
before unmap.  This series introduces a PMD-level swap entry so the
huge mapping can survive the swap round-trip and do_huge_pmd_swap_page()
can restore the PMD mapping directly on swap-in, without waiting for
khugepaged to collapse the range later.

The PMD swap entry is a compact page-table encoding for HPAGE_PMD_NR
consecutive swap slots.  swap_map accounting remains per-slot and is
unchanged.  Importantly, a PMD swap entry does not promise that the swap
cache always contains one PMD-sized folio.  While the cache is empty or
contains one PMD-sized folio, PMD-level handling can proceed.  Once the
cache has split/per-slot state, users either inspect the individual
slots directly (mincore) or split the PMD swap entry and retry through
the PTE path (fault, swapoff, MADV_WILLNEED, UFFDIO_MOVE).  Likewise,
if any slot is still backed by zswap's per-page store, PMD-order
swap-in consumers split and let the PTE path load the range page by page;
an all-on-disk range can still be read back as one PMD-sized folio.

The series is ordered so every consumer can handle PMD swap entries
before the swap-out producer starts installing them.  The swap-out patch
is the last functional change.

Patch breakdown:

  1. mm: add PMD swap entry detection support
     Add pmd_is_swap_entry(), teach the softleaf layer that PMD swap
     entries are valid PMD softleaf entries, and add per-arch
     pmd_swp_*exclusive helpers.

  2. mm: add PMD swap entry splitting support
     Teach __split_huge_pmd_locked() to split a PMD swap entry into
     HPAGE_PMD_NR PTE swap entries.  This is the common fallback path
     whenever PMD-level handling is unsafe or unavailable.

  3. mm: handle PMD swap entries in fork path
     Copy a PMD swap entry in fork via one
     swap_dup_entries_direct(HPAGE_PMD_NR) operation, with the same
     swapoff/mmlist preparation rules used by PTE entries.

  4. mm: zswap: add range lookup for large-folio swapin
     Teach zswap_load() to let all-on-disk large folio reads proceed to
     the backing swap device, and add zswap_range_has_entry() so PMD
     swap-entry consumers can split only when a specific range has
     per-page zswap state.

  5. mm: swap in PMD swap entries as whole THPs during swapoff
     Add swap_pmd_cache_lookup() and use it from swapoff.  Empty cache
     and PMD-sized cache state can be handled at PMD order; split cache
     state, zswap-backed slots, allocation/read failure, or non-uptodate
     folios split the PMD and fall back to unuse_pte_range().

  6. mm: handle PMD swap entries in non-present PMD walkers
     Teach zap, mprotect, soft-dirty, uffd-wp, smaps, mincore,
     mempolicy, khugepaged, HMM, and madvise walkers about PMD swap
     entries.  mincore reports PMD-sized cache state directly and checks
     per-page slots after the cache has split.

  7. mm: handle PMD swap entries in MADV_WILLNEED
     Let MADV_WILLNEED prefetch a PMD swap entry at PMD order when safe,
     treat an already cached PMD-sized folio as complete, and split/retry
     through PTEs for split cache state, zswap-backed slots, or races
     with per-slot cache population.

  8. mm: handle PMD swap entries in UFFDIO_MOVE
     Move PMD swap entries whole when the covered range is empty or
     backed by one PMD-sized folio.  Split/per-slot cache state returns
     -EAGAIN after splitting so retry can use the PTE move path and
     update per-page rmap metadata.

  9. mm: handle PMD swap entry faults on swap-in
     Add do_huge_pmd_swap_page().  It maps a PMD-sized cached folio
     directly, or allocates/reads at PMD order when the cache is empty
     and the range has no zswap entries.  Split cache state, zswap-backed
     slots, and PMD-order resource failures fall back to the PTE path.

 10. mm: install PMD swap entries on swap-out
     Stop forcing TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios
     and install one PMD swap entry instead.  Zswap still stores the THP
     as per-page compressed entries; PMD-order swap-in consumers preserve
     a PMD-sized cached folio or read an all-on-disk range as a whole THP,
     and split before reading per-page zswap state.

 11. selftests/mm: add PMD swap entry tests
     Add pmd_swap selftests covering swap-out/in, fork, fork+COW,
     repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
     mincore, MADV_FREE, MADV_WILLNEED, UFFDIO_MOVE, and swapoff.

Notes on zswap:

  Native PMD-order zswap load/store is intentionally left for a follow-up.
  Alexandre Ghiti is currently working this.
  This series can still preserve PMD swap entries while zswap is enabled:
  zswap stores the THP as order-0 entries, and PMD-order swap-in
  consumers split any range that has zswap entries before reading it.  If
  zswap has written the whole range back to disk, or the swap cache still
  contains one PMD-sized folio, PMD-level handling can proceed.

v2 -> v3: https://lore.kernel.org/all/20260602142537.198755-1-usama.arif@linux.dev/
- Clarified the PMD swap entry rule: it is a compact encoding for
  HPAGE_PMD_NR swap slots, not a guarantee that swap cache always has
  one PMD-sized folio. (Lance Yang)
- Swapoff, fault, MADV_WILLNEED, and UFFDIO_MOVE now classify the
  whole PMD swap-cache range and split/retry through the PTE path for
  split/per-slot cache state. (Lance Yang)
- mincore handles PMD swap entries without assuming one lookup covers
  a split swap-cache range. (Lance Yang)
- UFFDIO_MOVE rechecks all HPAGE_PMD_NR slots before moving an empty
  PMD swap-cache range, avoiding stale rmap metadata for per-slot
  cached folios.
- Added a standalone zswap prerequisite patch from Alexandre that
  distinguishes all-on-disk large-folio ranges from ranges with
  per-page zswap entries.
- Replaced the global zswap-ever-enabled policy with per-range zswap
  checks: PMD swap entries can still be installed while zswap is
  enabled, and PMD-order swap-in consumers split when the range has
  per-page zswap state.
- Added a mincore selftest and updated MADV_WILLNEED coverage so the
  test checks that the PMD swap entry remains in place until first
  touch.  Total pmd_swap coverage is now 14 tests.


v1 -> v2: https://lore.kernel.org/all/20260427100553.2754667-1-usama.arif@linux.dev/
- Patch 1: convert two additional softleaf_to_pmd() callers that
  landed in mm-unstable since v1 (mm/debug_vm_pgtable.c,
  mm/migrate_device.c) (Dev)
- Patch 2: rename helper ensure_on_mmlist() to
  mm_prepare_for_swap_entries() to better describe its purpose
  (David)
- Patch 3: drop VM_WARN_ON_ONCE(!pmd_is_migration_entry) as
  Dev posted it as a separate patch.
- Patch 5 (new): move softleaf_to_folio() inside the device-private
  branch in migrate_vma_collect_pmd(); same class of fix as patch 4
  but for the migrate-device PMD walker.
- Patch 6 (new): rename CONFIG_ARCH_ENABLE_THP_MIGRATION to
  CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF so the gate that now drives
  swap-entry support too is named for what it actually controls
  (PMD softleaf entries), not just migration. (Dev)
- Patch 7: add the missing pmd_swp_exclusive / mkexclusive /
  clear_exclusive helpers for powerpc.
- Patches 10 and 14: use upstream swapin_sync() (bundles
  swap_cache_alloc_folio + swap_read_folio + the -EEXIST race
  retry) instead of the bespoke swapin_alloc_pmd_folio() helper
  from v1; do_swap_page and shmem_swapin_folio use the same
  helper (Kairui)
- Patch 10: construct a stack vm_fault for the swapoff swap-in so
  the allocator can resolve a mempolicy, mirroring how the PTE
  swapoff path (unuse_pte_range) already does it.
- Patch 11: extend coverage to check_pmd_state() in khugepaged so a
  swapped-out PMD-mapped THP is treated as SCAN_PMD_MAPPED (matches
  the existing migration-entry handling). Route the
  pmd_trans_huge_lock() branch of mincore_pte_range() through
  mincore_swap() so a swapped-out PMD-mapped THP isn't reported as
  resident.
- Patch 12 (new): handle PMD swap entries in MADV_WILLNEED via
  swapin_sync(BIT(HPAGE_PMD_ORDER)); a naive order-0 read-ahead
  would force the subsequent fault to split.
- Patch 13: refuse UFFDIO_MOVE with -EBUSY if the swap-cache folio
  was split between swap-out and the move, matching
  move_pages_pte()'s rejection of large folios; otherwise only one
  of the 512 anon-rmaps would be re-anchored to dst_vma.
- Patch 16: alloc_fill_swap_thp() now uses the existing
  mmap_pmd_aligned() helper so tests don't flake/skip based on VA
  placement; new MADV_WILLNEED test that watches the PMD-order
  mTHP swpin counter; swapoff test restructured to use the
  kselftest_harness ASSERT cleanup blocks (no double swapoff, no
  verify-after-munmap).
- Collected Acks and Reviews.

[1] https://lore.kernel.org/all/20260630164143.1595669-1-usama.arif@linux.dev/ 
 
Alexandre Ghiti (1):
  mm: zswap: add range lookup for large-folio swapin

Usama Arif (10):
  mm: add PMD swap entry detection support
  mm: add PMD swap entry splitting support
  mm: handle PMD swap entries in fork path
  mm: swap in PMD swap entries as whole THPs during swapoff
  mm: handle PMD swap entries in non-present PMD walkers
  mm: handle PMD swap entries in MADV_WILLNEED
  mm: handle PMD swap entries in UFFDIO_MOVE
  mm: handle PMD swap entry faults on swap-in
  mm: install PMD swap entries on swap-out
  selftests/mm: add PMD swap entry tests

 arch/arm64/include/asm/pgtable.h             |   4 +
 arch/loongarch/include/asm/pgtable.h         |  17 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |  15 +
 arch/riscv/include/asm/pgtable.h             |  15 +
 arch/s390/include/asm/pgtable.h              |  15 +
 arch/x86/include/asm/pgtable.h               |  15 +
 fs/proc/task_mmu.c                           |  43 +-
 include/linux/huge_mm.h                      |  11 +
 include/linux/leafops.h                      |  24 +-
 include/linux/pgtable.h                      |  17 +
 include/linux/swap.h                         |   4 +-
 include/linux/vm_event_item.h                |   1 +
 include/linux/zswap.h                        |   7 +
 mm/hmm.c                                     |   3 +-
 mm/huge_memory.c                             | 565 ++++++++++++++-
 mm/internal.h                                |  36 +
 mm/khugepaged.c                              |   6 +
 mm/madvise.c                                 |  89 ++-
 mm/memory.c                                  |  42 +-
 mm/mempolicy.c                               |   2 +
 mm/mincore.c                                 |  45 +-
 mm/rmap.c                                    |  20 +
 mm/swap.h                                    |  17 +
 mm/swap_state.c                              |  44 ++
 mm/swapfile.c                                | 161 ++++-
 mm/vmscan.c                                  |   9 +-
 mm/vmstat.c                                  |   1 +
 mm/zswap.c                                   |  46 +-
 tools/testing/selftests/mm/Makefile          |   1 +
 tools/testing/selftests/mm/pmd_swap.c        | 702 +++++++++++++++++++
 30 files changed, 1878 insertions(+), 99 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pmd_swap.c

-- 
2.53.0-Meta


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 01/11] mm: add PMD swap entry detection support
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 02/11] mm: add PMD swap entry splitting support Usama Arif
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Currently when a PMD-mapped THP is swapped out, the PMD is always split into 512 PTE-level swap entries.  To preserve huge page information across swap cycles, later patches will install a single PMD-level swap entry instead.  This patch adds the infrastructure to detect those entries.

Teach the softleaf layer to recognise PMD swap entries: pmd_is_swap_entry() detects them and softleaf_is_valid_pmd_entry() accepts them as a valid non-present type.  Clear the exclusive overlay bit in softleaf_from_pmd() before decoding, matching how soft_dirty and uffd_wp bits are already stripped.

Add pmd_swp_mkexclusive(), pmd_swp_exclusive(), and pmd_swp_clear_exclusive() helpers to each architecture that supports PMD softleaf entries (x86, arm64, s390, riscv, loongarch, powerpc), mirroring the existing PTE swap exclusive helpers in each arch's pgtable.h.  Provide generic no-op PMD swap exclusive fallbacks for architectures without PMD softleaf support, matching the generic PMD swap soft-dirty fallbacks.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/arm64/include/asm/pgtable.h             |  4 ++++
 arch/loongarch/include/asm/pgtable.h         | 17 ++++++++++++++
 arch/powerpc/include/asm/book3s/64/pgtable.h | 15 ++++++++++++
 arch/riscv/include/asm/pgtable.h             | 15 ++++++++++++
 arch/s390/include/asm/pgtable.h              | 15 ++++++++++++
 arch/x86/include/asm/pgtable.h               | 15 ++++++++++++
 include/linux/leafops.h                      | 24 ++++++++++++++++----
 include/linux/pgtable.h                      | 17 ++++++++++++++
 8 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 984badfa9a74..0de4a2917fe2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -601,6 +601,10 @@ static inline int pmd_protnone(pmd_t pmd)
 #define pmd_swp_clear_uffd_wp(pmd) \
 				pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)))
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+#define pmd_swp_exclusive(pmd)	pte_swp_exclusive(pmd_pte(pmd))
+#define pmd_swp_mkexclusive(pmd)	pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)))
+#define pmd_swp_clear_exclusive(pmd) \
+				pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)))
 
 #define pmd_write(pmd)		pte_write(pmd_pte(pmd))
 
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 223528c04d73..a63567eb4e3b 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -357,6 +357,23 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return pte;
 }
 
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	pmd_val(pmd) |= _PAGE_SWP_EXCLUSIVE;
+	return pmd;
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	pmd_val(pmd) &= ~_PAGE_SWP_EXCLUSIVE;
+	return pmd;
+}
+
 #define pte_none(pte)		(!(pte_val(pte) & ~_PAGE_GLOBAL))
 #define pte_present(pte)	(pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_no_exec(pte)	(pte_val(pte) & _PAGE_NO_EXEC)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6f30aa8a6490..e8467ea4f4de 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -699,6 +699,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE));
 }
 
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return __pmd_raw(pmd_raw(pmd) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return __pmd_raw(pmd_raw(pmd) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline bool check_pte_access(unsigned long access, unsigned long ptev)
 {
 	/*
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 2aa529e882d3..9918b3f51efd 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -930,6 +930,21 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return pte_swp_exclusive(pmd_pte(pmd));
+}
+
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)));
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)));
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline bool pmd_soft_dirty(pmd_t pmd)
 {
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 6faccfa63b09..e9d29439d817 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -870,6 +870,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
 }
 
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return set_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return clear_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_SOFT_DIRTY;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e0fd318d4004..2dd7a5c590a9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1529,6 +1529,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
 }
 
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pmd_swp_exclusive(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_EXCLUSIVE);
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 88888daeb018..988e59c6fa8a 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -102,6 +102,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
 		pmd = pmd_swp_clear_soft_dirty(pmd);
 	if (pmd_swp_uffd_wp(pmd))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
+	if (pmd_swp_exclusive(pmd))
+		pmd = pmd_swp_clear_exclusive(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 
 	/* Temporary until swp_entry_t eliminated. */
@@ -634,18 +636,30 @@ static inline bool pmd_is_migration_entry(pmd_t pmd)
  */
 static inline bool softleaf_is_valid_pmd_entry(softleaf_t entry)
 {
-	/* Only device private, migration entries valid for PMD. */
+	/* Device private, migration, and swap entries valid for PMD. */
 	return softleaf_is_device_private(entry) ||
-		softleaf_is_migration(entry);
+		softleaf_is_migration(entry) ||
+		softleaf_is_swap(entry);
+}
+
+/**
+ * pmd_is_swap_entry() - Does this PMD entry encode an actual swap entry?
+ * @pmd: PMD entry.
+ *
+ * Returns: true if the PMD encodes a swap entry, otherwise false.
+ */
+static inline bool pmd_is_swap_entry(pmd_t pmd)
+{
+	return softleaf_is_swap(softleaf_from_pmd(pmd));
 }
 
 /**
  * pmd_is_valid_softleaf() - Is this PMD entry a valid softleaf entry?
  * @pmd: PMD entry.
  *
- * PMD leaf entries are valid only if they are device private or migration
- * entries. This function asserts that a PMD leaf entry is valid in this
- * respect.
+ * PMD leaf entries are valid only if they are device private, migration,
+ * or swap entries. This function asserts that a PMD leaf entry is valid
+ * in this respect.
  *
  * Returns: true if the PMD entry is a valid leaf entry, otherwise false.
  */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e38f069c1c91..0b985ce6673e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1917,6 +1917,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 }
 #endif
 
+#ifndef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return false;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+
 #ifndef __HAVE_PFNMAP_TRACKING
 /*
  * Interfaces that can be used by architecture code to keep track of
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 02/11] mm: add PMD swap entry splitting support
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
  2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 03/11] mm: handle PMD swap entries in fork path Usama Arif
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Add a swap branch in __split_huge_pmd_locked() that splits a PMD swap
entry into 512 PTE swap entries.  Unlike migration splits, no folio
reference is needed because swap entries point to swap slots, not
pages.  Each PTE inherits the correct sub-slot offset and preserves
soft_dirty, uffd_wp, and exclusive flags.

This branch is reached from the explicit __split_huge_pmd() callers
that hit a non-present PMD: partial-range mprotect / munmap, the
wp_huge_pmd() PMD-COW fallback, and the swap-in / swapoff fallbacks
added in later patches when the cached folio is no longer PMD-sized.
page_vma_mapped_walk() does not iterate PMD swap entries, so
try_to_unmap_one() and try_to_migrate_one() do not reach this branch
and freeze=true cannot occur in this branch today.  page and folio
are therefore left uninitialized in the swap branch; a
VM_WARN_ON_ONCE(freeze) catches any future caller that breaks this
invariant before the freeze path dereferences page_to_pfn(page + i)
or put_page(page).

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bdd8635922f9..201193ce0373 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3143,6 +3143,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
 						 vma, haddr, rmap_flags);
 		}
+	} else if (pmd_is_swap_entry(*pmd)) {
+		VM_WARN_ON_ONCE(freeze);
+		old_pmd = *pmd;
+		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+		uffd_wp = pmd_swp_uffd_wp(old_pmd);
+		anon_exclusive = pmd_swp_exclusive(old_pmd);
 	} else {
 		/*
 		 * Up to this point the pmd is present and huge and userland has
@@ -3279,6 +3285,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
+	} else if (pmd_is_swap_entry(old_pmd)) {
+		softleaf_t sl_entry = softleaf_from_pmd(old_pmd);
+		pte_t swp_pte;
+		swp_entry_t sub_entry;
+
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR;
+		     i++, addr += PAGE_SIZE) {
+			sub_entry = swp_entry(swp_type(sl_entry),
+					      swp_offset(sl_entry) + i);
+			swp_pte = swp_entry_to_pte(sub_entry);
+			if (soft_dirty)
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (uffd_wp)
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			if (anon_exclusive)
+				swp_pte = pte_swp_mkexclusive(swp_pte);
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+			set_pte_at(mm, addr, pte + i, swp_pte);
+		}
 	} else {
 		pte_t entry;
 
@@ -3302,7 +3327,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_is_migration_entry(*pmd))
+	if (!pmd_is_migration_entry(*pmd) && !pmd_is_swap_entry(*pmd))
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 03/11] mm: handle PMD swap entries in fork path
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
  2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
  2026-07-03 17:38 ` [PATCH v3 02/11] mm: add PMD swap entry splitting support Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin Usama Arif
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Teach copy_huge_pmd()/copy_huge_non_present_pmd() about swap entries,
mirroring copy_nonpresent_pte().

swap_dup_entry_direct() gains a nr parameter (and is renamed to
swap_dup_entries_direct()) so it can duplicate a contiguous range of
swap slots in one call, matching the existing
swap_put_entries_direct(entry, nr) API.  Existing callers pass 1.

copy_huge_non_present_pmd() "copies" PMD swap entries during fork
instead of splitting, preserving the THP.  This mirrors
copy_nonpresent_pte() which duplicates the swap slot refcount,
clears the exclusive bit on the source, and adds the destination
mm to mmlist.  If swap_dup_entries_direct() fails (GFP_ATOMIC table
alloc), copy_huge_pmd() retries after swap_retry_table_alloc() with
GFP_KERNEL, matching the PTE retry in copy_pte_range().  The PMD is
stable across the retry because dup_mmap() holds write mmap_lock on
both mm_structs.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/swap.h |  4 ++--
 mm/huge_memory.c     | 53 ++++++++++++++++++++++++++++++++++++++------
 mm/memory.c          |  2 +-
 mm/swapfile.c        |  7 +++---
 4 files changed, 53 insertions(+), 13 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8d19be675baf..0b1db19e6ae3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -451,7 +451,7 @@ sector_t swap_folio_sector(struct folio *folio);
  * All entries must be allocated by folio_alloc_swap(). And they must have
  * a swap count > 1. See comments of folio_*_swap helpers for more info.
  */
-int swap_dup_entry_direct(swp_entry_t entry);
+int swap_dup_entries_direct(swp_entry_t entry, int nr);
 void swap_put_entries_direct(swp_entry_t entry, int nr);
 
 /*
@@ -495,7 +495,7 @@ static inline void free_swap_cache(struct folio *folio)
 {
 }
 
-static inline int swap_dup_entry_direct(swp_entry_t ent)
+static inline int swap_dup_entries_direct(swp_entry_t ent, int nr)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 201193ce0373..69e4e09ac1f6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1805,7 +1805,7 @@ bool touch_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return false;
 }
 
-static void copy_huge_non_present_pmd(
+static int copy_huge_non_present_pmd(
 		struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
@@ -1851,14 +1851,35 @@ static void copy_huge_non_present_pmd(
 		 */
 		folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
 					    dst_vma, src_vma);
+	} else if (softleaf_is_swap(entry)) {
+		int err;
+
+		/*
+		 * PMD swap entry: duplicate swap references and clear
+		 * exclusive on source, matching copy_nonpresent_pte().
+		 */
+		err = swap_dup_entries_direct(entry, HPAGE_PMD_NR);
+		if (err < 0)
+			return err;
+
+		mm_prepare_for_swap_entries(dst_mm);
+
+		if (pmd_swp_exclusive(pmd)) {
+			pmd = pmd_swp_clear_exclusive(pmd);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
 	}
 
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	if (softleaf_is_swap(entry))
+		add_mm_counter(dst_mm, MM_SWAPENTS, HPAGE_PMD_NR);
+	else
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	return 0;
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -1899,6 +1920,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(!pgtable))
 		goto out;
 
+retry:
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -1906,11 +1928,28 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
-	if (unlikely(thp_migration_supported() &&
-		     pmd_is_valid_softleaf(pmd))) {
-		copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr,
-					  dst_vma, src_vma, pmd, pgtable);
-		ret = 0;
+	if (unlikely(pmd_is_valid_softleaf(pmd))) {
+		ret = copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
+						addr, dst_vma, src_vma, pmd,
+						pgtable);
+		if (ret) {
+			spin_unlock(src_ptl);
+			spin_unlock(dst_ptl);
+			/*
+			 * For PMD swap entries -ENOMEM means the per-cluster
+			 * swap-extend table couldn't be GFP_ATOMIC-allocated.
+			 * try the GFP_KERNEL fallback once before giving up.
+			 */
+			if (ret == -ENOMEM) {
+				softleaf_t entry = softleaf_from_pmd(pmd);
+
+				if (softleaf_is_swap(entry) &&
+				    !swap_retry_table_alloc(entry, GFP_KERNEL))
+					goto retry;
+			}
+			pte_free(dst_mm, pgtable);
+			goto out;
+		}
 		goto out_unlock;
 	}
 
diff --git a/mm/memory.c b/mm/memory.c
index 6637c5b13c9b..e0819a562187 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -950,7 +950,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	struct page *page;
 
 	if (likely(softleaf_is_swap(entry))) {
-		if (swap_dup_entry_direct(entry) < 0)
+		if (swap_dup_entries_direct(entry, 1) < 0)
 			return -EIO;
 
 		mm_prepare_for_swap_entries(dst_mm);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5a69716b2052..0695dbd1a8b1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3899,8 +3899,9 @@ void si_swapinfo(struct sysinfo *val)
 }
 
 /*
- * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
+ * swap_dup_entries_direct() - Increase reference count of swap entries by one.
  * @entry: first swap entry from which we want to increase the refcount.
+ * @nr: number of contiguous swap entries to duplicate.
  *
  * Returns 0 for success, or -ENOMEM if the extend table is required
  * but could not be atomically allocated.  Returns -EINVAL if the swap
@@ -3912,7 +3913,7 @@ void si_swapinfo(struct sysinfo *val)
  * Also the swap entry must have a count >= 1. Otherwise folio_dup_swap should
  * be used.
  */
-int swap_dup_entry_direct(swp_entry_t entry)
+int swap_dup_entries_direct(swp_entry_t entry, int nr)
 {
 	struct swap_info_struct *si;
 
@@ -3929,7 +3930,7 @@ int swap_dup_entry_direct(swp_entry_t entry)
 	 */
 	VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry));
 
-	return swap_dup_entries_cluster(si, swp_offset(entry), 1);
+	return swap_dup_entries_cluster(si, swp_offset(entry), nr);
 }
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (2 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 03/11] mm: handle PMD swap entries in fork path Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Alexandre Ghiti,
	Usama Arif

From: Alexandre Ghiti <alexghiti@fb.com>

A large folio reaches zswap_load() only when the caller expects the
whole range to be on disk. Zswap still stores large folios as
independent order-0 entries, so reconstructing a large folio from
zswap entries would risk returning partially initialized data.

Teach zswap_load() to scan the covered range. If no slot is in zswap,
return -ENOENT so swap_read_folio() reads the backing device. If any
slot is still in zswap, fail the large-folio read so the caller can
fall back to per-page swapin.

Add zswap_range_has_entry() so PMD swap-entry consumers can make the
same range decision before attempting PMD-order swapin.

Signed-off-by: Alexandre Ghiti <alexghiti@fb.com>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/zswap.h |  7 +++++++
 mm/zswap.c            | 46 +++++++++++++++++++++++++++++++++----------
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..de10aa528597 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
 void zswap_folio_swapin(struct folio *folio);
 bool zswap_is_enabled(void);
 bool zswap_never_enabled(void);
+bool zswap_range_has_entry(swp_entry_t entry, unsigned int nr);
 #else
 
 struct zswap_lruvec_state {};
@@ -69,6 +70,12 @@ static inline bool zswap_never_enabled(void)
 	return true;
 }
 
+static inline bool zswap_range_has_entry(swp_entry_t entry,
+					 unsigned int nr)
+{
+	return false;
+}
+
 #endif
 
 #endif /* _LINUX_ZSWAP_H */
diff --git a/mm/zswap.c b/mm/zswap.c
index b5a17ea20237..89dd88a5223f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1559,6 +1559,27 @@ bool zswap_store(struct folio *folio)
 	return ret;
 }
 
+/**
+ * zswap_range_has_entry() - is any slot in [entry, entry + nr) in zswap?
+ * @entry: base swap entry of the range
+ * @nr: number of contiguous slots to check
+ */
+bool zswap_range_has_entry(swp_entry_t entry, unsigned int nr)
+{
+	pgoff_t offset = swp_offset(entry);
+	XA_STATE(xas, swap_zswap_tree(entry), offset);
+	bool found;
+
+	if (!nr || zswap_never_enabled())
+		return false;
+
+	rcu_read_lock();
+	found = !!xas_find(&xas, offset + nr - 1);
+	rcu_read_unlock();
+
+	return found;
+}
+
 /**
  * zswap_load() - load a folio from zswap
  * @folio: folio to load
@@ -1571,10 +1592,9 @@ bool zswap_store(struct folio *folio)
  *  NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page()
  *  will SIGBUS).
  *
- *  -EINVAL: if the swapped out content was in zswap, but the page belongs
- *  to a large folio, which is not supported by zswap. The folio is unlocked,
- *  but NOT marked up-to-date, so that an IO error is emitted (e.g.
- *  do_swap_page() will SIGBUS).
+ *  -EIO: if a slot in a large-folio range is unexpectedly still in zswap.
+ *  The folio is unlocked, but NOT marked up-to-date, so that an IO
+ *  error is emitted (e.g. do_swap_page() will SIGBUS).
  *
  *  -ENOENT: if the swapped out content was not in zswap. The folio remains
  *  locked on return.
@@ -1593,13 +1613,19 @@ int zswap_load(struct folio *folio)
 		return -ENOENT;
 
 	/*
-	 * Large folios should not be swapped in while zswap is being used, as
-	 * they are not properly handled. Zswap does not properly load large
-	 * folios, and a large folio may only be partially in zswap.
+	 * A large folio reaches zswap_load() only when its whole range is
+	 * expected to be on disk: PMD swap-entry consumers split before
+	 * calling into PMD-order swapin whenever any slot is still in zswap.
+	 * Confirm the range is entirely absent from zswap and return -ENOENT
+	 * so the caller reads it from disk; if a slot is unexpectedly still in
+	 * zswap, fail the read rather than return partially-initialized data.
 	 */
-	if (WARN_ON_ONCE(folio_test_large(folio))) {
-		folio_unlock(folio);
-		return -EINVAL;
+	if (folio_test_large(folio)) {
+		if (zswap_range_has_entry(swp, folio_nr_pages(folio))) {
+			folio_unlock(folio);
+			return -EIO;
+		}
+		return -ENOENT;
 	}
 
 	entry = xa_load(tree, offset);
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (3 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Add swap_pmd_cache_lookup() to classify the swap cache behind a PMD
swap entry as empty, backed by one PMD-sized folio, or requiring
per-page handling because at least one covered slot has a smaller folio
in the swap cache.  PMD swap entries are handled at PMD granularity only
while the covered cache range is empty or backed by a PMD-sized folio;
a split cache forces the entry to be split and retried through the PTE
path.

Add unuse_pmd() and call it from unuse_pmd_range() to swap in
PMD-level swap entries as whole THPs during swapoff.  This mirrors
the existing unuse_pte_range() but operates at PMD granularity.

If the PMD-order folio cannot be allocated, the swap cache already
contains per-page folios in the covered range (e.g. split in the swap
cache by deferred_split_scan() or memory_failure() while the PMD swap
entry was installed), or the folio is not uptodate, the PMD swap entry
is split into PTE-level entries via __split_huge_pmd() and a non-zero
error is returned so unuse_pmd_range() falls through to
unuse_pte_range(), which handles the individual entries at order-0.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/swap.h       |  17 ++++++
 mm/swap_state.c |  44 ++++++++++++++
 mm/swapfile.c   | 154 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 215 insertions(+)

diff --git a/mm/swap.h b/mm/swap.h
index 44ab8e1e595b..17c2c57e0da4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -303,6 +303,23 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
 bool swap_cache_has_folio(swp_entry_t entry);
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
+enum swap_pmd_cache {
+	SWAP_PMD_CACHE_EMPTY,
+	SWAP_PMD_CACHE_HUGE,
+	SWAP_PMD_CACHE_SPLIT,
+};
+
+#ifdef CONFIG_THP_SWAP
+enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry,
+					  struct folio **foliop);
+#else
+static inline enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry,
+							struct folio **foliop)
+{
+	*foliop = NULL;
+	return SWAP_PMD_CACHE_EMPTY;
+}
+#endif
 void swap_cache_del_folio(struct folio *folio);
 struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask,
 				     unsigned long orders, struct vm_fault *vmf,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6fd6e3415b71..9b9ca82ace4b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -118,6 +118,50 @@ bool swap_cache_has_folio(swp_entry_t entry)
 	return swp_tb_is_folio(swp_tb);
 }
 
+#ifdef CONFIG_THP_SWAP
+/**
+ * swap_pmd_cache_lookup - classify the swap cache behind a PMD swap entry
+ * @entry: first swap slot encoded by the PMD swap entry
+ * @foliop: returned PMD-sized folio, with a reference, if present
+ *
+ * A PMD swap entry is a compact page-table encoding for HPAGE_PMD_NR
+ * consecutive swap slots. The swap cache behind those slots can be empty,
+ * one PMD-sized folio, or per-slot folios after the original folio was split.
+ *
+ * Context: Caller must keep @entry valid using the usual swap cache rules.
+ * Return: SWAP_PMD_CACHE_EMPTY if no slot in the PMD range has a cached folio,
+ * SWAP_PMD_CACHE_HUGE if one PMD-sized folio covers the range, or
+ * SWAP_PMD_CACHE_SPLIT if the range needs per-page handling.
+ */
+enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry,
+					  struct folio **foliop)
+{
+	unsigned int type = swp_type(entry);
+	pgoff_t offset = swp_offset(entry);
+	struct folio *folio;
+	int i;
+
+	*foliop = NULL;
+
+	folio = swap_cache_get_folio(entry);
+	if (folio) {
+		if (folio_nr_pages(folio) == HPAGE_PMD_NR) {
+			*foliop = folio;
+			return SWAP_PMD_CACHE_HUGE;
+		}
+		folio_put(folio);
+		return SWAP_PMD_CACHE_SPLIT;
+	}
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		if (swap_cache_has_folio(swp_entry(type, offset + i)))
+			return SWAP_PMD_CACHE_SPLIT;
+	}
+
+	return SWAP_PMD_CACHE_EMPTY;
+}
+#endif
+
 /**
  * swap_cache_get_shadow - Looks up a shadow in the swap cache.
  * @entry: swap entry used for the lookup.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0695dbd1a8b1..664956da60c8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,7 @@
 #include <linux/suspend.h>
 #include <linux/zswap.h>
 #include <linux/plist.h>
+#include <linux/huge_mm.h>
 
 #include <asm/tlbflush.h>
 #include <linux/leafops.h>
@@ -2641,6 +2642,147 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	return 0;
 }
 
+/*
+ * unuse_pmd - Map a locked folio at PMD granularity during swapoff.
+ *
+ * The caller provides a locked, swapped-in folio.  Returns 0 on success
+ * (PMD was mapped).  Returns -EAGAIN if the swap cache folio no longer
+ * matches the entry or the PMD changed under the lock (try_to_unuse will
+ * rescan).  Returns -EIO if the folio is not uptodate; in that case the
+ * PMD is split so unuse_pte_range() can handle individual pages.
+ */
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		     unsigned long addr, softleaf_t entry,
+		     struct folio *folio)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pmd_t new_pmd, old_pmd;
+	spinlock_t *ptl;
+	rmap_t rmap_flags = RMAP_NONE;
+	bool exclusive;
+
+	if (unlikely(!folio_matches_swap_entry(folio, entry)))
+		return -EAGAIN;
+
+	if (unlikely(!folio_test_uptodate(folio))) {
+		__split_huge_pmd(vma, pmd, addr, false);
+		return -EIO;
+	}
+
+	page = folio_page(folio, 0);
+
+	ptl = pmd_lock(mm, pmd);
+	old_pmd = pmdp_get(pmd);
+
+	if (!pmd_is_swap_entry(old_pmd) ||
+	    softleaf_from_pmd(old_pmd).val != entry.val) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	exclusive = pmd_swp_exclusive(old_pmd);
+
+	/*
+	 * Some architectures may have to restore extra metadata to the folio
+	 * when reading from swap. This metadata may be indexed by swap entry
+	 * so this must be called before folio_put_swap().
+	 */
+	arch_swap_restore(folio_swap(entry, folio), folio);
+
+	add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+	new_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+	new_pmd = pmd_mkold(new_pmd);
+	if (pmd_swp_soft_dirty(old_pmd))
+		new_pmd = pmd_mksoft_dirty(new_pmd);
+	if (pmd_swp_uffd_wp(old_pmd))
+		new_pmd = pmd_mkuffd_wp(new_pmd);
+
+	if (exclusive)
+		rmap_flags |= RMAP_EXCLUSIVE;
+
+	folio_get(folio);
+	if (!folio_test_anon(folio))
+		folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
+	else
+		folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags);
+
+	set_pmd_at(mm, addr, pmd, new_pmd);
+	folio_put_swap(folio, NULL);
+
+	spin_unlock(ptl);
+
+	folio_free_swap(folio);
+	return 0;
+}
+
+/*
+ * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success.
+ * If the swap cache no longer has one PMD-sized folio, zswap may require
+ * per-page loading, or a PMD-order allocation/read fails, split the PMD so
+ * the caller can fall back to unuse_pte_range(). Otherwise propagates the
+ * error from unuse_pmd().
+ */
+static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd,
+			   unsigned long addr, softleaf_t entry)
+{
+	struct folio *folio;
+	enum swap_pmd_cache cache_state;
+	int ret;
+
+	cache_state = swap_pmd_cache_lookup(entry, &folio);
+	if (cache_state == SWAP_PMD_CACHE_SPLIT) {
+		ret = -EAGAIN;
+		goto split_fallback;
+	}
+	if (!folio) {
+		struct vm_fault vmf = {
+			.vma = vma,
+			.address = addr,
+			.real_address = addr,
+			.pmd = pmd,
+		};
+
+		if (zswap_range_has_entry(entry, HPAGE_PMD_NR)) {
+			ret = -EAGAIN;
+			goto split_fallback;
+		}
+
+		folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
+				    BIT(HPAGE_PMD_ORDER), &vmf, NULL, 0);
+		if (IS_ERR_OR_NULL(folio)) {
+			ret = folio ? PTR_ERR(folio) : -ENOMEM;
+			goto split_fallback;
+		}
+	}
+
+	folio_lock(folio);
+	folio_wait_writeback(folio);
+	/*
+	 * If the cached folio is no longer PMD-sized (e.g. split in the
+	 * swap cache by deferred_split_scan() or memory_failure() while
+	 * the PMD swap entry was installed), the PMD swap entry no longer
+	 * maps a single contiguous folio.  Split the PMD swap entry so
+	 * unuse_pte_range() can swap the per-slot folios in individually.
+	 */
+	if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+		folio_unlock(folio);
+		folio_put(folio);
+		ret = -EAGAIN;
+		goto split_fallback;
+	}
+	ret = unuse_pmd(vma, pmd, addr, entry, folio);
+	folio_unlock(folio);
+	folio_put(folio);
+	return ret;
+
+split_fallback:
+	__split_huge_pmd(vma, pmd, addr, false);
+	return ret;
+}
+
 static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				unsigned int type)
@@ -2653,6 +2795,18 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	do {
 		cond_resched();
 		next = pmd_addr_end(addr, end);
+
+		pmd_t pmdval = pmdp_get(pmd);
+
+		if (pmd_is_swap_entry(pmdval)) {
+			softleaf_t sl = softleaf_from_pmd(pmdval);
+
+			if (swp_type(sl) == type) {
+				if (!unuse_pmd_entry(vma, pmd, addr, sl))
+					continue;
+			}
+		}
+
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
 		if (ret)
 			return ret;
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (4 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Teach the remaining non-present PMD walkers about swap entries,
mirroring the PTE-level equivalents.

smaps_pmd_entry() accounts swap and swap_pss via a new shared
smaps_account_swap() helper used by both PTE and PMD paths.

move_soft_dirty_pmd(), clear_soft_dirty_pmd(), and make_uffd_wp_pmd(),
pagemap_pmd_range_thp() and change_huge_pmd() handle swap entries
alongside migration entries.

hmm_vma_handle_absent_pmd() faults in PMD swap entries via
hmm_vma_fault() instead of returning -EFAULT. The first per-page
handle_mm_fault() call triggers do_huge_pmd_swap_page(), which maps
the entire folio; subsequent calls become harmless
huge_pmd_set_accessed() and the walker retries with a present PMD.

madvise_free_huge_pmd() handles PMD swap entries directly: for a
full-range MADV_FREE it clears the PMD, frees the deposited page
table, and releases the swap slots; for a partial range it splits to
PTE swap entries. Without this, MADV_FREE silently becomes a no-op
on swapped-out THPs, leaking swap slots.

zap_huge_pmd() frees swap slots via swap_put_entries_direct(),
matching zap_nonpresent_ptes().

change_non_present_huge_pmd() skips write-permission changes for swap
entries and only updates uffd_wp, matching change_softleaf_pte().

madvise_cold_or_pageout_pte_range() skips PMD swap entries early.
MADV_COLD and MADV_PAGEOUT operate on resident folios, so a swapped-out
THP has nothing to deactivate or reclaim; skipping also prevents the
walker from descending into or splitting the PMD swap entry. The locked
THP path also treats a racing PMD swap entry as handled before checking
for other non-present PMD types.

mincore_pte_range() routes the pmd_trans_huge_lock() branch through
mincore_swap() for non-present PMDs, matching how the PTE path
already calls mincore_swap() for non-present PTEs. Without this a
swapped-out PMD-mapped THP would be reported as resident, because
pmd_is_huge() (and therefore pmd_trans_huge_lock()) accepts any
non-present non-none PMD and the old branch unconditionally did
memset(vec, 1, nr). mincore_swap() returns 1 for migration /
device-private entries (preserving the prior behavior for those)
and checks swap-cache residency for swap entries.

queue_folios_pmd() in mempolicy silently skips swap entries, matching
the PTE walker which only counts migration entries as failures.
Without this, mbind(MPOL_MF_STRICT) would spuriously return -EIO on
a swapped-out THP.

check_pmd_state() in khugepaged returns SCAN_PMD_MAPPED for PMD swap
entries, treating a swapped-out THP as still being a THP from
khugepaged's perspective and matching the existing migration-entry
handling.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/proc/task_mmu.c | 43 +++++++++++++++++++++-------------
 mm/hmm.c           |  3 ++-
 mm/huge_memory.c   | 58 +++++++++++++++++++++++++++++++++++-----------
 mm/khugepaged.c    |  6 +++++
 mm/madvise.c       | 14 ++++++++++-
 mm/mempolicy.c     |  2 ++
 mm/mincore.c       | 45 ++++++++++++++++++++++++++++++++++-
 7 files changed, 139 insertions(+), 32 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1fb5acd88ad0..f85899eec80f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1046,6 +1046,23 @@ static void smaps_pte_hole_lookup(unsigned long addr, struct mm_walk *walk)
 #endif
 }
 
+static void smaps_account_swap(struct mem_size_stats *mss,
+		softleaf_t entry, unsigned long size)
+{
+	int mapcount;
+
+	mss->swap += size;
+	mapcount = swp_swapcount(entry);
+	if (mapcount >= 2) {
+		u64 pss_delta = (u64)size << PSS_SHIFT;
+
+		do_div(pss_delta, mapcount);
+		mss->swap_pss += pss_delta;
+	} else {
+		mss->swap_pss += (u64)size << PSS_SHIFT;
+	}
+}
+
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		struct mm_walk *walk)
 {
@@ -1067,18 +1084,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		const softleaf_t entry = softleaf_from_pte(ptent);
 
 		if (softleaf_is_swap(entry)) {
-			int mapcount;
-
-			mss->swap += PAGE_SIZE;
-			mapcount = swp_swapcount(entry);
-			if (mapcount >= 2) {
-				u64 pss_delta = (u64)PAGE_SIZE << PSS_SHIFT;
-
-				do_div(pss_delta, mapcount);
-				mss->swap_pss += pss_delta;
-			} else {
-				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
-			}
+			smaps_account_swap(mss, entry, PAGE_SIZE);
 		} else if (softleaf_has_pfn(entry)) {
 			if (softleaf_is_device_private(entry))
 				present = true;
@@ -1108,9 +1114,13 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (pmd_present(*pmd)) {
 		page = vm_normal_page_pmd(vma, addr, *pmd);
 		present = true;
-	} else if (unlikely(thp_migration_supported())) {
+	} else {
 		const softleaf_t entry = softleaf_from_pmd(*pmd);
 
+		if (softleaf_is_swap(entry)) {
+			smaps_account_swap(mss, entry, HPAGE_PMD_SIZE);
+			return;
+		}
 		if (softleaf_has_pfn(entry))
 			page = softleaf_to_page(entry);
 	}
@@ -1752,7 +1762,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		pmd = pmd_clear_soft_dirty(pmd);
 
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
-	} else if (pmd_is_migration_entry(pmd)) {
+	} else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) {
 		pmd = pmd_swp_clear_soft_dirty(pmd);
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
 	}
@@ -2112,7 +2122,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
 			flags |= PM_UFFD_WP;
 		if (pm->show_pfn)
 			frame = pmd_pfn(pmd) + idx;
-	} else if (thp_migration_supported()) {
+	} else if (pmd_is_swap_entry(pmd) ||
+		   (thp_migration_supported() && pmd_is_migration_entry(pmd))) {
 		const softleaf_t entry = softleaf_from_pmd(pmd);
 		unsigned long offset;
 
@@ -2550,7 +2561,7 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
 		old = pmdp_invalidate_ad(vma, addr, pmdp);
 		pmd = pmd_mkuffd_wp(old);
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
-	} else if (pmd_is_migration_entry(pmd)) {
+	} else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) {
 		pmd = pmd_swp_mkuffd_wp(pmd);
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
 	}
diff --git a/mm/hmm.c b/mm/hmm.c
index 4f3f627d2b47..c5356910c580 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -370,7 +370,8 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start,
 	required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns,
 					      npages, 0);
 	if (required_fault) {
-		if (softleaf_is_device_private(entry))
+		if (softleaf_is_device_private(entry) ||
+		    softleaf_is_swap(entry))
 			return hmm_vma_fault(addr, end, required_fault, walk);
 		else
 			return -EFAULT;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 69e4e09ac1f6..4cbd6123bf18 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2312,6 +2312,14 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+	pgtable_t pgtable;
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pte_free(mm, pgtable);
+	mm_dec_nr_ptes(mm);
+}
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -2336,8 +2344,23 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		goto out;
 
 	if (unlikely(!pmd_present(orig_pmd))) {
+		if (pmd_is_swap_entry(orig_pmd)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				__split_huge_pmd(vma, pmd, addr, false);
+				goto out_unlocked;
+			}
+			softleaf_t sl = softleaf_from_pmd(orig_pmd);
+
+			pmdp_huge_get_and_clear(mm, addr, pmd);
+			zap_deposited_table(mm, pmd);
+			spin_unlock(ptl);
+			swap_put_entries_direct(sl, HPAGE_PMD_NR);
+			add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+			return true;
+		}
 		VM_BUG_ON(thp_migration_supported() &&
-				  !pmd_is_migration_entry(orig_pmd));
+			  !pmd_is_migration_entry(orig_pmd));
 		goto out;
 	}
 
@@ -2386,15 +2409,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
-static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
-{
-	pgtable_t pgtable;
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pte_free(mm, pgtable);
-	mm_dec_nr_ptes(mm);
-}
-
 static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t pmdval, struct folio *folio, bool is_present)
 {
@@ -2487,6 +2501,16 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	arch_check_zapped_pmd(vma, orig_pmd);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 
+	if (pmd_is_swap_entry(orig_pmd)) {
+		softleaf_t sl = softleaf_from_pmd(orig_pmd);
+
+		zap_deposited_table(mm, pmd);
+		spin_unlock(ptl);
+		swap_put_entries_direct(sl, HPAGE_PMD_NR);
+		add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+		return true;
+	}
+
 	is_present = pmd_present(orig_pmd);
 	folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
 	has_deposit = has_deposited_pgtable(vma, orig_pmd, folio);
@@ -2519,7 +2543,8 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 static pmd_t move_soft_dirty_pmd(pmd_t pmd)
 {
 	if (pgtable_supports_soft_dirty()) {
-		if (unlikely(pmd_is_migration_entry(pmd)))
+		if (unlikely(pmd_is_migration_entry(pmd) ||
+			     pmd_is_swap_entry(pmd)))
 			pmd = pmd_swp_mksoft_dirty(pmd);
 		else if (pmd_present(pmd))
 			pmd = pmd_mksoft_dirty(pmd);
@@ -2599,7 +2624,14 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
 	pmd_t newpmd;
 
 	VM_WARN_ON(!pmd_is_valid_softleaf(*pmd));
-	if (softleaf_is_migration_write(entry)) {
+
+	/*
+	 * PMD swap entries don't encode write permission in the entry type,
+	 * so only uffd_wp flag changes apply. No folio lookup needed.
+	 */
+	if (softleaf_is_swap(entry)) {
+		newpmd = *pmd;
+	} else if (softleaf_is_migration_write(entry)) {
 		const struct folio *folio = softleaf_to_folio(entry);
 
 		/*
@@ -2658,7 +2690,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (!ptl)
 		return 0;
 
-	if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) {
+	if (pmd_is_valid_softleaf(*pmd)) {
 		change_non_present_huge_pmd(mm, addr, pmd, uffd_wp,
 					    uffd_wp_resolve);
 		goto unlock;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76db49..8c10e7e6fc0d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1101,6 +1101,12 @@ static inline enum scan_result check_pmd_state(pmd_t *pmd)
 	 */
 	if (pmd_is_migration_entry(pmde))
 		return SCAN_PMD_MAPPED;
+	/*
+	 * A PMD-mapped THP that has been swapped out is still a THP from
+	 * khugepaged's perspective; treat it like a present huge PMD.
+	 */
+	if (pmd_is_swap_entry(pmde))
+		return SCAN_PMD_MAPPED;
 	if (!pmd_present(pmde))
 		return SCAN_NO_PTE_TABLE;
 	if (pmd_trans_huge(pmde))
diff --git a/mm/madvise.c b/mm/madvise.c
index 9292f60b19aa..0d6aa0608f70 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -374,6 +374,15 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 					!can_do_file_pageout(vma);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/*
+	 * Swapped-out THPs have no resident folio to deactivate or reclaim.
+	 * Avoid descending into or splitting a PMD swap entry.
+	 */
+	if (pmd_is_swap_entry(*pmd)) {
+		walk->action = ACTION_CONTINUE;
+		return 0;
+	}
+
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
 		unsigned long next = pmd_addr_end(addr, end);
@@ -384,6 +393,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			return 0;
 
 		orig_pmd = *pmd;
+		if (pmd_is_swap_entry(orig_pmd))
+			goto huge_unlock;
+
 		if (is_huge_zero_pmd(orig_pmd))
 			goto huge_unlock;
 
@@ -665,7 +677,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr, max_nr;
 
 	next = pmd_addr_end(addr, end);
-	if (pmd_trans_huge(*pmd))
+	if (pmd_trans_huge(*pmd) || pmd_is_swap_entry(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			return 0;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bba65898aee1..584ce81d4781 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -658,6 +658,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 		qp->nr_failed++;
 		return;
 	}
+	if (unlikely(pmd_is_swap_entry(*pmd)))
+		return;
 	folio = pmd_folio(*pmd);
 	if (is_huge_zero_folio(folio)) {
 		walk->action = ACTION_CONTINUE;
diff --git a/mm/mincore.c b/mm/mincore.c
index 53b982803771..ddf7c96964b0 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -99,6 +99,41 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
 	return present;
 }
 
+#ifdef CONFIG_THP_SWAP
+static void mincore_pmd_swap(swp_entry_t entry, unsigned long addr,
+			     unsigned long end, unsigned char *vec)
+{
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	unsigned long start = (addr - haddr) >> PAGE_SHIFT;
+	unsigned long nr = (end - addr) >> PAGE_SHIFT;
+	struct folio *folio;
+	enum swap_pmd_cache state;
+	int i;
+
+	state = swap_pmd_cache_lookup(entry, &folio);
+	if (state == SWAP_PMD_CACHE_HUGE) {
+		memset(vec, folio_test_uptodate(folio), nr);
+		folio_put(folio);
+		return;
+	}
+
+	if (state == SWAP_PMD_CACHE_EMPTY) {
+		memset(vec, 0, nr);
+		return;
+	}
+
+	/*
+	 * The PMD swap entry is only a compact encoding for consecutive swap
+	 * slots. If the PMD-sized swapcache folio was split, report residency
+	 * from the individual slots covered by this mincore() range.
+	 */
+	for (i = 0; i < nr; i++)
+		vec[i] = mincore_swap(swp_entry(swp_type(entry),
+						swp_offset(entry) + start + i),
+				      false);
+}
+#endif
+
 /*
  * Later we can get more picky about what "in core" means precisely.
  * For now, simply check to see if the page is in the page cache,
@@ -172,7 +207,15 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		memset(vec, 1, nr);
+		if (pmd_is_swap_entry(*pmd)) {
+#ifdef CONFIG_THP_SWAP
+			mincore_pmd_swap(softleaf_from_pmd(*pmd), addr, end, vec);
+#else
+			memset(vec, 0, nr);
+#endif
+		} else {
+			memset(vec, 1, nr);
+		}
 		spin_unlock(ptl);
 		goto out;
 	}
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (5 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

swapin_walk_pmd_entry() walks PTEs and skips non-present PMDs, so
MADV_WILLNEED is a no-op on a PMD swap entry.

Handle PMD swap entries under pmd_trans_huge_lock(). If the covered
swap-cache range already has a PMD-sized folio, there is nothing left
to prefetch. If the range has split cache state, or any covered slot
currently has a zswap entry, split the PMD swap entry and ask the
walker to retry so the PTE path can handle the individual slots.

Otherwise pin the swap device and read the folio in at PMD order via
swapin_sync(BIT(HPAGE_PMD_ORDER)). This keeps the subsequent fault on
the do_huge_pmd_swap_page() path and avoids order-0 readahead
needlessly splitting the PMD swap entry. If PMD-order swapin races
with per-slot swap-cache population after dropping the PMD lock, split
and retry through the PTE path instead.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/madvise.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 0d6aa0608f70..78a08039e173 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -32,6 +32,7 @@
 #include <linux/leafops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/zswap.h>
 
 #include <asm/tlb.h>
 
@@ -193,6 +194,79 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	spinlock_t *ptl;
 	unsigned long addr;
 
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		pmd_t pmdval = *pmd;
+
+		if (pmd_is_swap_entry(pmdval)) {
+			softleaf_t entry = softleaf_from_pmd(pmdval);
+			struct vm_fault vmf = {
+				.vma = vma,
+				.address = start,
+				.real_address = start,
+				.pmd = pmd,
+			};
+			struct swap_info_struct *si;
+			struct folio *folio;
+			enum swap_pmd_cache cache_state;
+			bool split = false;
+
+			cache_state = swap_pmd_cache_lookup(entry, &folio);
+			if (cache_state == SWAP_PMD_CACHE_HUGE) {
+				folio_put(folio);
+				spin_unlock(ptl);
+				goto ret;
+			}
+			if (cache_state == SWAP_PMD_CACHE_SPLIT ||
+			    zswap_range_has_entry(entry, HPAGE_PMD_NR)) {
+				spin_unlock(ptl);
+				__split_huge_pmd(vma, pmd, start, false);
+				walk->action = ACTION_AGAIN;
+				goto ret;
+			}
+
+			/*
+			 * Pin the swap device under the PMD lock so the
+			 * PMD-swap-entry observation keeps the entry valid for
+			 * swapin_sync().
+			 */
+			si = get_swap_device(entry);
+			spin_unlock(ptl);
+			if (!si)
+				goto ret;
+
+			folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
+					    BIT(HPAGE_PMD_ORDER), &vmf,
+					    NULL, 0);
+			/*
+			 * The empty-cache observation was made under the PMD
+			 * lock, but swap cache can change after dropping it. If
+			 * PMD-order swapin lost a race to per-slot cache state,
+			 * retry through the PTE path.
+			 */
+			if (IS_ERR(folio)) {
+				if (PTR_ERR(folio) == -EBUSY)
+					split = true;
+			} else if (folio) {
+				if (folio_nr_pages(folio) != HPAGE_PMD_NR)
+					split = true;
+				else if (!folio_test_locked(folio) &&
+					 !folio_test_uptodate(folio) &&
+					 zswap_range_has_entry(entry,
+							       HPAGE_PMD_NR))
+					split = true;
+				folio_put(folio);
+			}
+			put_swap_device(si);
+			if (split) {
+				__split_huge_pmd(vma, pmd, start, false);
+				walk->action = ACTION_AGAIN;
+			}
+			goto ret;
+		}
+		spin_unlock(ptl);
+	}
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		pte_t pte;
 		softleaf_t entry;
@@ -221,6 +295,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	if (ptep)
 		pte_unmap_unlock(ptep, ptl);
 	swap_read_unplug(splug);
+ret:
 	cond_resched();
 
 	return 0;
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (6 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in Usama Arif
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

move_pages_huge_pmd() returned -ENOENT for any non-trans_huge,
non-migration PMD, which fails aligned UFFDIO_MOVE on a swapped-out
THP -- the PMD swap entry is a perfectly valid mapping that should
move whole. Splitting via the move_pages_ptes() fallback isn't a
substitute either: __split_huge_pmd_locked() splits a PMD swap entry
into HPAGE_PMD_NR PTE swap entries pointing at the same swap-cache
folio, but move_swap_pte() refuses any swap-cache folio that is still
large and returns -EBUSY.

Add move_swap_pmd(), modeled on move_swap_pte(), that moves the swap
entry whole-PMD and re-anchors a PMD-sized swap-cache folio's anon rmap
to the destination VMA. Reject !pmd_swp_exclusive() entries with
-EBUSY to preserve UFFDIO_MOVE's single-owner semantics, propagate
soft-dirty, and carry the deposited page table across with the entry.

The dispatcher in move_pages_huge_pmd() now waits for migration on a
PMD migration entry (matching the PTE path) and routes PMD swap
entries through move_swap_pmd() after pinning the swap device and
arming an mmu_notifier range so secondary MMUs see the move.

Before moving, classify the whole PMD swap-cache range with
swap_pmd_cache_lookup(). A PMD swap entry can be moved whole only if
the covered range is empty or backed by one PMD-sized folio. If the
range already has per-slot cache state, split the PMD swap entry and
return -EAGAIN so the caller retries through the PTE path.

If a PMD-sized folio is cached, lock and revalidate that it still
matches the PMD swap entry. If no folio is cached, recheck all
HPAGE_PMD_NR slots under both PMD locks before moving the entry; any
per-slot folio that appears needs the PTE move path to update its rmap
metadata. This avoids moving the PMD while cached folios still point at
the old anon_vma/index.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 132 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4cbd6123bf18..fdc1a503c609 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2810,6 +2810,72 @@ int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 #endif
 
 #ifdef CONFIG_USERFAULTFD
+/*
+ * Move a PMD-level swap entry from src_pmd to dst_pmd. Both PMD locks are
+ * acquired here; src_folio (if present) must already be locked. The deposited
+ * page table backing the source THP is moved across with the entry.
+ */
+static int move_swap_pmd(struct mm_struct *mm, struct vm_area_struct *dst_vma,
+			 unsigned long dst_addr, unsigned long src_addr,
+			 pmd_t *dst_pmd, pmd_t *src_pmd,
+			 pmd_t orig_dst_pmd, pmd_t orig_src_pmd,
+			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
+			 struct folio *src_folio, swp_entry_t entry)
+{
+	pgtable_t src_pgtable;
+	pmd_t moved_pmd;
+
+	/*
+	 * The folio may have been freed and reused for a different swap entry
+	 * while it was unlocked. Re-verify the association.
+	 */
+	if (src_folio && unlikely(!folio_matches_swap_entry(src_folio, entry) ||
+				  folio_nr_pages(src_folio) != HPAGE_PMD_NR))
+		return -EAGAIN;
+
+	double_pt_lock(dst_ptl, src_ptl);
+
+	if (!pmd_same(*src_pmd, orig_src_pmd) ||
+	    !pmd_same(*dst_pmd, orig_dst_pmd)) {
+		double_pt_unlock(dst_ptl, src_ptl);
+		return -EAGAIN;
+	}
+
+	/*
+	 * If the folio is in the swap cache, re-anchor its anon rmap to the
+	 * destination VMA so a future swap-in fault at dst_addr finds it.
+	 * Otherwise, re-check the whole PMD swap range: a PMD swap entry is
+	 * only a compact encoding for 512 swap slots, and any per-slot cached
+	 * folio would need the PTE move path to update its rmap metadata.
+	 */
+	if (src_folio) {
+		folio_move_anon_rmap(src_folio, dst_vma);
+		src_folio->index = linear_page_index(dst_vma, dst_addr);
+	} else {
+		unsigned int type = swp_type(entry);
+		pgoff_t offset = swp_offset(entry);
+		int i;
+
+		for (i = 0; i < HPAGE_PMD_NR; i++) {
+			if (swap_cache_has_folio(swp_entry(type, offset + i))) {
+				double_pt_unlock(dst_ptl, src_ptl);
+				return -EAGAIN;
+			}
+		}
+	}
+
+	moved_pmd = pmdp_huge_get_and_clear(mm, src_addr, src_pmd);
+	if (pgtable_supports_soft_dirty())
+		moved_pmd = pmd_swp_mksoft_dirty(moved_pmd);
+	set_pmd_at(mm, dst_addr, dst_pmd, moved_pmd);
+
+	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+
+	double_pt_unlock(dst_ptl, src_ptl);
+	return 0;
+}
+
 /*
  * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
  * the caller, but it must return after releasing the page_table_lock.
@@ -2844,11 +2910,76 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	}
 
 	if (!pmd_trans_huge(src_pmdval)) {
-		spin_unlock(src_ptl);
 		if (pmd_is_migration_entry(src_pmdval)) {
+			spin_unlock(src_ptl);
 			pmd_migration_entry_wait(mm, &src_pmdval);
 			return -EAGAIN;
 		}
+		if (pmd_is_swap_entry(src_pmdval)) {
+			swp_entry_t entry;
+			struct swap_info_struct *si;
+			enum swap_pmd_cache cache_state;
+
+			/*
+			 * UFFDIO_MOVE on anon mappings requires single-owner
+			 * semantics; refuse to move a shared swap entry.
+			 */
+			if (!pmd_swp_exclusive(src_pmdval)) {
+				spin_unlock(src_ptl);
+				return -EBUSY;
+			}
+
+			entry = softleaf_from_pmd(src_pmdval);
+			spin_unlock(src_ptl);
+
+			/* Pin the swap device against a racing swapoff. */
+			si = get_swap_device(entry);
+			if (unlikely(!si))
+				return -EAGAIN;
+
+			src_folio = NULL;
+			cache_state = swap_pmd_cache_lookup(entry, &src_folio);
+			if (cache_state == SWAP_PMD_CACHE_SPLIT) {
+				put_swap_device(si);
+				__split_huge_pmd(src_vma, src_pmd, src_addr, false);
+				return -EAGAIN;
+			}
+
+			mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
+						mm, src_addr,
+						src_addr + HPAGE_PMD_SIZE);
+			mmu_notifier_invalidate_range_start(&range);
+
+			if (src_folio) {
+				folio_lock(src_folio);
+				if (!folio_matches_swap_entry(src_folio, entry) ||
+				    folio_nr_pages(src_folio) != HPAGE_PMD_NR) {
+					err = -EAGAIN;
+					folio_unlock(src_folio);
+					folio_put(src_folio);
+					mmu_notifier_invalidate_range_end(&range);
+					put_swap_device(si);
+					__split_huge_pmd(src_vma, src_pmd,
+							 src_addr, false);
+					return err;
+				}
+			}
+
+			dst_ptl = pmd_lockptr(mm, dst_pmd);
+			err = move_swap_pmd(mm, dst_vma, dst_addr, src_addr,
+					    dst_pmd, src_pmd, dst_pmdval,
+					    src_pmdval, dst_ptl, src_ptl,
+					    src_folio, entry);
+
+			mmu_notifier_invalidate_range_end(&range);
+			if (src_folio) {
+				folio_unlock(src_folio);
+				folio_put(src_folio);
+			}
+			put_swap_device(si);
+			return err;
+		}
+		spin_unlock(src_ptl);
 		return -ENOENT;
 	}
 
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (7 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 10/11] mm: install PMD swap entries on swap-out Usama Arif
  2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Add do_huge_pmd_swap_page() and dispatch to it from __handle_mm_fault()
when vmf->orig_pmd encodes a swap entry.  The handler resolves the
entire 2 MB mapping in one shot, mirroring do_swap_page() (PTE path)
at PMD granularity:

  - Look up the folio in the swap cache; on a miss, allocate a
    PMD-order folio via swap_cache_alloc_folio() and read from swap.

  - After locking, re-validate that the folio still corresponds to our
    entry and is still PMD-sized.  Between the unlocked cache lookup
    and the lock, a racing swap-in on the same entry may have removed
    it from the cache via folio_free_swap(), or reclaim / memory_failure
    / deferred-split may have split the folio into smaller folios.

  - Restore soft_dirty and uffd_wp from the swap PMD.  Map writable
    only when the entry was exclusive, the VMA permits writes, and
    uffd-wp is not armed.  Drop the exclusive marker when the cached
    folio is under writeback to an SWP_STABLE_WRITES backend (zram,
    encrypted) so the PMD is mapped read-only; a later write COWs
    into a fresh folio rather than corrupting the in-flight writeback.
    Mirrors do_swap_page().

  - When the resulting PMD is read-only but the fault was a write,
    update vmf->orig_pmd and call wp_huge_pmd() in the same handler
    to COW immediately rather than forcing a second fault.  Mask
    VM_FAULT_FALLBACK from its return: a PMD-COW that splits to
    PTE-level is normal, but the bit is part of VM_FAULT_ERROR and
    arch fault handlers BUG() on it without SIGBUS/HWPOISON/SIGSEGV.
    Requires exposing wp_huge_pmd() via mm/internal.h.

  - Free the swap slot via should_try_to_free_swap() (hoisted from
    mm/memory.c into mm/internal.h so PTE- and PMD-level swap-in
    share the heuristic).

When PMD-order resources are unavailable (folio allocation fails,
the cached folio was split, memcg charge fails, or swapin_folio()
races) split the PMD swap entry into 512 PTE swap entries via
__split_huge_pmd() and return 0.  The fault retries and do_swap_page()
takes over per-PTE.  This avoids returning VM_FAULT_OOM for transient
PMD-order allocation failures.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h |   9 ++
 mm/huge_memory.c        | 216 ++++++++++++++++++++++++++++++++++++++++
 mm/internal.h           |  36 +++++++
 mm/memory.c             |  40 +-------
 4 files changed, 265 insertions(+), 36 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1487bf4af1a7..9ec475ccfc91 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -531,6 +531,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
 
+#ifdef CONFIG_THP_SWAP
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+#else
+static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+	return 0;
+}
+#endif
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fdc1a503c609..5fa60324a2f0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -41,6 +41,7 @@
 #include <linux/pgalloc.h>
 #include <linux/pgalloc_tag.h>
 #include <linux/pagewalk.h>
+#include <linux/zswap.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -2312,6 +2313,221 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_THP_SWAP
+/**
+ * do_huge_pmd_swap_page() - Handle a fault on a PMD-level swap entry.
+ * @vmf: Fault context. vmf->orig_pmd contains the swap PMD.
+ *
+ * A PMD swap entry is a compact encoding for HPAGE_PMD_NR consecutive swap
+ * slots. If the swap cache still has one PMD-sized folio covering the range,
+ * map it directly at PMD level. If the range has been split into per-page
+ * cache state, or zswap may have per-page state for it, split the PMD swap
+ * entry and retry at PTE granularity.
+ *
+ * Return: VM_FAULT_* flags.
+ */
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct folio *folio;
+	struct page *page;
+	struct swap_info_struct *si;
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	softleaf_t entry;
+	swp_entry_t swp_entry;
+	pmd_t pmd;
+	vm_fault_t ret = 0;
+	bool exclusive;
+	rmap_t rmap_flags = RMAP_NONE;
+	enum swap_pmd_cache cache_state;
+
+	entry = softleaf_from_pmd(vmf->orig_pmd);
+	if (unlikely(!softleaf_is_swap(entry)))
+		return 0;
+
+	swp_entry = entry;
+
+	/* Prevent swapoff from happening to us. */
+	si = get_swap_device(swp_entry);
+	if (unlikely(!si))
+		return 0;
+
+	cache_state = swap_pmd_cache_lookup(swp_entry, &folio);
+	if (cache_state == SWAP_PMD_CACHE_SPLIT)
+		goto split_fallback;
+	if (!folio) {
+		/*
+		 * PMD swap entries encode ordinary per-page swap slots. If any
+		 * slot is in zswap, split and let the PTE swap path load the
+		 * range per page. Otherwise the range is all on disk and can be
+		 * read back as one PMD-sized folio.
+		 */
+		if (zswap_range_has_entry(swp_entry, HPAGE_PMD_NR))
+			goto split_fallback;
+
+		folio = swapin_sync(swp_entry, GFP_HIGHUSER_MOVABLE,
+				    BIT(HPAGE_PMD_ORDER), vmf, NULL, 0);
+		if (IS_ERR_OR_NULL(folio))
+			goto split_fallback;
+
+		/* Had to read from swap area: Major fault */
+		ret = VM_FAULT_MAJOR;
+		count_vm_event(PGMAJFAULT);
+		count_memcg_event_mm(mm, PGMAJFAULT);
+	}
+
+	ret |= folio_lock_or_retry(folio, vmf);
+	if (ret & VM_FAULT_RETRY)
+		goto out_release;
+
+	/* Verify the folio is still in swap cache and matches our entry */
+	if (unlikely(!folio_matches_swap_entry(folio, swp_entry)))
+		goto out_page;
+
+	/*
+	 * Folio should be PMD-sized; if not (e.g. split in swap cache),
+	 * split the PMD swap entry and retry at PTE level.
+	 */
+	if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+		folio_unlock(folio);
+		folio_put(folio);
+		goto split_fallback;
+	}
+
+	if (unlikely(!folio_test_uptodate(folio))) {
+		if (zswap_range_has_entry(swp_entry, HPAGE_PMD_NR)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto split_fallback;
+		}
+		ret = VM_FAULT_SIGBUS;
+		goto out_page;
+	}
+
+	page = folio_page(folio, 0);
+	arch_swap_restore(folio_swap(swp_entry, folio), folio);
+
+	if ((vmf->flags & FAULT_FLAG_WRITE) && !folio_test_lru(folio))
+		lru_add_drain();
+
+	folio_throttle_swaprate(folio, GFP_KERNEL);
+
+	/* Lock the PMD and verify it hasn't changed */
+	vmf->ptl = pmd_lock(mm, vmf->pmd);
+	if (unlikely(!pmd_same(vmf->orig_pmd, pmdp_get(vmf->pmd)))) {
+		spin_unlock(vmf->ptl);
+		goto out_page;
+	}
+
+	exclusive = pmd_swp_exclusive(vmf->orig_pmd);
+
+	/*
+	 * Some swap backends (e.g. zram) don't support concurrent page
+	 * modifications while under writeback. If we map exclusive on such
+	 * a backend while the folio is still under writeback, the writeback
+	 * may see partial modifications and corrupt the swap slot. Drop the
+	 * exclusive marker and only map R/O for that case; further GUP
+	 * references can't appear once the page is fully unmapped, so this
+	 * is safe.
+	 */
+	if (exclusive && folio_test_writeback(folio) &&
+	    data_race(si->flags & SWP_STABLE_WRITES))
+		exclusive = false;
+
+	/*
+	 * Set up the PMD mapping. Similar to do_swap_page() but at PMD level.
+	 */
+	add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+	pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+	pmd = pmd_mkyoung(pmd);
+
+	if (pmd_swp_soft_dirty(vmf->orig_pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+	if (pmd_swp_uffd_wp(vmf->orig_pmd))
+		pmd = pmd_mkuffd_wp(pmd);
+
+	/*
+	 * Check exclusivity to determine if we can map writable.
+	 */
+	if (exclusive || folio_ref_count(folio) == 1) {
+		if ((vma->vm_flags & VM_WRITE) &&
+		    !userfaultfd_huge_pmd_wp(vma, pmd) &&
+		    !pmd_needs_soft_dirty_wp(vma, pmd)) {
+			pmd = pmd_mkwrite(pmd, vma);
+			if (vmf->flags & FAULT_FLAG_WRITE) {
+				pmd = pmd_mkdirty(pmd);
+				vmf->flags &= ~FAULT_FLAG_WRITE;
+			}
+		}
+		rmap_flags |= RMAP_EXCLUSIVE;
+	}
+
+	flush_icache_pages(vma, page, HPAGE_PMD_NR);
+
+	if (!folio_test_anon(folio))
+		folio_add_new_anon_rmap(folio, vma, haddr, rmap_flags);
+	else
+		folio_add_anon_rmap_pmd(folio, page, vma, haddr, rmap_flags);
+
+	folio_put_swap(folio, NULL);
+
+	set_pmd_at(mm, haddr, vmf->pmd, pmd);
+	update_mmu_cache_pmd(vma, haddr, vmf->pmd);
+
+	/* Update orig_pmd for any follow-up wp_huge_pmd() below. */
+	vmf->orig_pmd = pmd;
+
+	/*
+	 * Conditionally try to free up the swap cache. Do it after mapping,
+	 * so raced page faults will likely see the folio in swap cache and
+	 * wait on the folio lock.
+	 */
+	if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags))
+		folio_free_swap(folio);
+
+	spin_unlock(vmf->ptl);
+
+	folio_unlock(folio);
+	put_swap_device(si);
+
+	/*
+	 * If the write fault wasn't satisfied above (folio is shared without
+	 * exclusivity), fall through to wp_huge_pmd to handle COW or
+	 * userfaultfd-wp without forcing a second fault.
+	 *
+	 * wp_huge_pmd() may return VM_FAULT_FALLBACK if it had to split the
+	 * PMD; that's a normal outcome — the natural PTE-level refault will
+	 * complete the COW. Mask it so callers (and the arch fault handler)
+	 * don't see VM_FAULT_FALLBACK as a fatal VM_FAULT_ERROR.
+	 */
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		vm_fault_t wp_ret = wp_huge_pmd(vmf);
+
+		wp_ret &= ~VM_FAULT_FALLBACK;
+		ret |= wp_ret;
+		if (ret & VM_FAULT_ERROR)
+			ret &= VM_FAULT_ERROR;
+	}
+
+	return ret;
+
+out_page:
+	folio_unlock(folio);
+out_release:
+	folio_put(folio);
+	put_swap_device(si);
+	return ret;
+
+split_fallback:
+	__split_huge_pmd(vma, vmf->pmd, haddr, false);
+	put_swap_device(si);
+	return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
 static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t pgtable;
diff --git a/mm/internal.h b/mm/internal.h
index fa4fb69444ec..5c7f5b408ba3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -506,6 +506,42 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
 }
 
 vm_fault_t do_swap_page(struct vm_fault *vmf);
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf);
+
+/*
+ * Check if we should call folio_free_swap to free the swap cache.
+ * folio_free_swap only frees the swap cache to release the slot if swap
+ * count is zero, so we don't need to check the swap count here.
+ */
+static inline bool should_try_to_free_swap(struct swap_info_struct *si,
+					   struct folio *folio,
+					   struct vm_area_struct *vma,
+					   unsigned int extra_refs,
+					   unsigned int fault_flags)
+{
+	if (!folio_test_swapcache(folio))
+		return false;
+	/*
+	 * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
+	 * cache can help save some IO or memory overhead, but these devices
+	 * are fast, and meanwhile, swap cache pinning the slot deferring the
+	 * release of metadata or fragmentation is a more critical issue.
+	 */
+	if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		return true;
+	if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
+	    folio_test_mlocked(folio))
+		return true;
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * user. Try freeing the swapcache to get rid of the swapcache
+	 * reference only in case it's likely that we'll be the exclusive user.
+	 */
+	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
+		folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+}
+
 void folio_rotate_reclaimable(struct folio *folio);
 bool __folio_end_writeback(struct folio *folio);
 void deactivate_file_folio(struct folio *folio);
diff --git a/mm/memory.c b/mm/memory.c
index e0819a562187..478b54423713 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4497,40 +4497,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
-/*
- * Check if we should call folio_free_swap to free the swap cache.
- * folio_free_swap only frees the swap cache to release the slot if swap
- * count is zero, so we don't need to check the swap count here.
- */
-static inline bool should_try_to_free_swap(struct swap_info_struct *si,
-					   struct folio *folio,
-					   struct vm_area_struct *vma,
-					   unsigned int extra_refs,
-					   unsigned int fault_flags)
-{
-	if (!folio_test_swapcache(folio))
-		return false;
-	/*
-	 * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
-	 * cache can help save some IO or memory overhead, but these devices
-	 * are fast, and meanwhile, swap cache pinning the slot deferring the
-	 * release of metadata or fragmentation is a more critical issue.
-	 */
-	if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
-		return true;
-	if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
-	    folio_test_mlocked(folio))
-		return true;
-	/*
-	 * If we want to map a page that's in the swapcache writable, we
-	 * have to detect via the refcount if we're really the exclusive
-	 * user. Try freeing the swapcache to get rid of the swapcache
-	 * reference only in case it's likely that we'll be the exclusive user.
-	 */
-	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-		folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
-}
-
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 {
 	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -6200,8 +6166,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 	return VM_FAULT_FALLBACK;
 }
 
-/* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
@@ -6486,6 +6451,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 		if (pmd_is_migration_entry(vmf.orig_pmd))
 			pmd_migration_entry_wait(mm, vmf.pmd);
+		else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+			 pmd_is_swap_entry(vmf.orig_pmd))
+			return do_huge_pmd_swap_page(&vmf);
 		return 0;
 	}
 	if (pmd_trans_huge(vmf.orig_pmd)) {
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 10/11] mm: install PMD swap entries on swap-out
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (8 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
  10 siblings, 0 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap
entries before unmap, losing the huge mapping across the swap
round-trip and forcing khugepaged to rebuild it later. The contiguous
swap range was already secured when the folio was added to the swap
cache (a non-contiguous allocation would have split the folio earlier),
so the PMD can be replaced by a single PMD-level swap entry instead.

This patch mirrors the existing PTE swap-out path at PMD granularity:
- shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable
  swapcache folios. zswap is handled by the PMD swap-in users: if any
  covered slot currently has a zswap entry, they split the PMD swap
  entry and fall back to the per-PTE path.
- try_to_unmap_one() now has a PMD branch that calls
  set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by
  HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the
  fallback.
- set_pmd_swap_entry() is the installer. Mirroring the PTE swap-out
  sequence at PMD granularity, it clears the present mapping (keeping
  the original for rollback), bumps the swap_map refcount for the
  folio's 512 slots, transfers the exclusive state in the swap entry,
  propagates the dirty bit to the folio so writeback is not lost,
  and installs a swap PMD that preserves the original
  soft-dirty / uffd-wp / exclusive bits. Any failing step rolls back
  the present mapping.

The swap entry value matches what 512 PTE swap entries would encode, so
swap_map refcounting is unchanged: each of the 512 slots carries a
count of 1, released individually on later split or together on swap-in.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h       |  2 +
 include/linux/vm_event_item.h |  1 +
 mm/huge_memory.c              | 78 +++++++++++++++++++++++++++++++++++
 mm/rmap.c                     | 20 +++++++++
 mm/vmscan.c                   |  9 +++-
 mm/vmstat.c                   |  1 +
 6 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9ec475ccfc91..b746f8c8db69 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -533,6 +533,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
 
 #ifdef CONFIG_THP_SWAP
 vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+		       struct folio *folio);
 #else
 static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
 {
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7267c06674c0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
 		THP_SWPOUT_FALLBACK,
+		THP_SWPOUT_PMD,
 #endif
 #ifdef CONFIG_BALLOON
 		BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5fa60324a2f0..7ec81a9c4bc1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -5450,3 +5450,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	trace_remove_migration_pmd(address, pmd_val(pmde));
 }
 #endif
+
+#ifdef CONFIG_THP_SWAP
+/**
+ * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry.
+ * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and
+ *        pvmw->pte NULL (i.e. PMD-mapped).
+ * @folio: The folio being swapped out. Must be in the swap cache.
+ *
+ * This installs a PMD-level swap entry in place of a present PMD mapping,
+ * avoiding the need to split the PMD into PTE-level swap entries.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+		       struct folio *folio)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *page = folio_page(folio, 0);
+	bool anon_exclusive;
+	pmd_t pmdval;
+	swp_entry_t entry;
+	pmd_t pmdswp;
+
+	if (!(pvmw->pmd && !pvmw->pte))
+		return 0;
+
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+	VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+
+	if (unlikely(folio_test_swapbacked(folio) !=
+			folio_test_swapcache(folio))) {
+		WARN_ON_ONCE(1);
+		return -EBUSY;
+	}
+
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+	pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd);
+
+	/* Update high watermark before we lower rss */
+	update_hiwater_rss(mm);
+
+	if (folio_dup_swap(folio, NULL) < 0) {
+		set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+		return -ENOMEM;
+	}
+
+	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+	anon_exclusive = PageAnonExclusive(page);
+	if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
+		folio_put_swap(folio, NULL);
+		set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+		return -EBUSY;
+	}
+
+	if (pmd_dirty(pmdval))
+		folio_mark_dirty(folio);
+
+	entry = folio->swap;
+	pmdswp = softleaf_to_pmd(entry);
+	if (pmd_soft_dirty(pmdval))
+		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+	if (pmd_uffd_wp(pmdval))
+		pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+	if (anon_exclusive)
+		pmdswp = pmd_swp_mkexclusive(pmdswp);
+	set_pmd_at(mm, haddr, pvmw->pmd, pmdswp);
+
+	folio_remove_rmap_pmd(folio, page, vma);
+	folio_put(folio);
+
+	count_vm_event(THP_SWPOUT_PMD);
+	return 0;
+}
+#endif /* CONFIG_THP_SWAP */
diff --git a/mm/rmap.c b/mm/rmap.c
index 0fb7a1b82cf3..ffc7aa62a29e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2079,6 +2079,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto walk_abort;
 			}
 
+#ifdef CONFIG_THP_SWAP
+			/*
+			 * If the folio is in the swap cache and we're not
+			 * asked to split, install a PMD-level swap entry.
+			 */
+			if (!(flags & TTU_SPLIT_HUGE_PMD) &&
+			    folio_test_anon(folio) &&
+			    folio_test_swapcache(folio)) {
+				if (set_pmd_swap_entry(&pvmw, folio))
+					goto walk_abort;
+
+				mm_prepare_for_swap_entries(mm);
+				add_mm_counter(mm, MM_ANONPAGES,
+					       -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_SWAPENTS,
+					       HPAGE_PMD_NR);
+				goto walk_done;
+			}
+#endif
+
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				/*
 				 * We temporarily have to drop the PTL and
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56fe5393f30f..3d7999c3f1ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1321,7 +1321,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
 			bool was_swapbacked = folio_test_swapbacked(folio);
 
-			if (folio_test_pmd_mappable(folio))
+			/*
+			 * With THP_SWAP, PMD-mappable folios already in the
+			 * swap cache can be unmapped with a PMD-level swap
+			 * entry, avoiding the cost of splitting the PMD.
+			 */
+			if (folio_test_pmd_mappable(folio) &&
+			    !(IS_ENABLED(CONFIG_THP_SWAP) &&
+			      folio_test_swapcache(folio)))
 				flags |= TTU_SPLIT_HUGE_PMD;
 			/*
 			 * Without TTU_SYNC, try_to_unmap will only begin to
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7b93fbf9af09..629055399987 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
 	[I(THP_ZERO_PAGE_ALLOC_FAILED)]		= "thp_zero_page_alloc_failed",
 	[I(THP_SWPOUT)]				= "thp_swpout",
 	[I(THP_SWPOUT_FALLBACK)]		= "thp_swpout_fallback",
+	[I(THP_SWPOUT_PMD)]			= "thp_swpout_pmd",
 #endif
 #ifdef CONFIG_BALLOON
 	[I(BALLOON_INFLATE)]			= "balloon_inflate",
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 11/11] selftests/mm: add PMD swap entry tests
  2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (9 preceding siblings ...)
  2026-07-03 17:38 ` [PATCH v3 10/11] mm: install PMD swap entries on swap-out Usama Arif
@ 2026-07-03 17:38 ` Usama Arif
  2026-07-04  6:27   ` kernel test robot
  2026-07-04  8:30   ` kernel test robot
  10 siblings, 2 replies; 14+ messages in thread
From: Usama Arif @ 2026-07-03 17:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, linux-mm
  Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
	shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
	Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif

Exercise the PMD swap entry paths. The tests allocate a PMD-mapped
THP, write a known pattern, swap it out via MADV_PAGEOUT, and then
exercise different code paths:

 - swap-out / swap-in round-trip with data verification
 - fork with read-only access from both parent and child
 - fork with writes in both processes to verify COW isolation
 - repeated swap cycles to catch reference counting issues
 - write fault on a swapped PMD to verify dirty handling and PMD
   mapping restoration
 - munmap of a swapped PMD (zap_huge_pmd swap slot cleanup)
 - mprotect on a swapped PMD (change_non_present_huge_pmd)
 - UFFDIO_MOVE on a swapped PMD (move_pages_huge_pmd swap path)
 - mremap of a swapped PMD (move_soft_dirty_pmd)
 - pagemap reading (pagemap_pmd_range_thp softleaf_has_pfn guard)
 - mincore on a swapped PMD without faulting it in
 - MADV_FREE on a swapped PMD: verifies swap slots are freed via
   pagemap and the memory reads back as zero
 - MADV_WILLNEED on a swapped PMD
 - swapoff with active PMD swap entries

When zswap is enabled, PMD-order consumers may split a PMD swap entry
and retry through the PTE path because zswap stores the range as
per-page entries. In that configuration, the tests still verify data
correctness and log that the PMD mapping assertion is skipped. With
zswap disabled, the tests assert that write faults, UFFDIO_MOVE,
MADV_WILLNEED, and swapoff restore a PMD-mapped THP where expected.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 tools/testing/selftests/mm/Makefile   |   1 +
 tools/testing/selftests/mm/pmd_swap.c | 702 ++++++++++++++++++++++++++
 2 files changed, 703 insertions(+)
 create mode 100644 tools/testing/selftests/mm/pmd_swap.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index ed321ae709da..4561fa2ac80f 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -105,6 +105,7 @@ TEST_GEN_FILES += guard-regions
 TEST_GEN_FILES += merge
 TEST_GEN_FILES += rmap
 TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += pmd_swap
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/pmd_swap.c b/tools/testing/selftests/mm/pmd_swap.c
new file mode 100644
index 000000000000..b4a60a6b50d9
--- /dev/null
+++ b/tools/testing/selftests/mm/pmd_swap.c
@@ -0,0 +1,702 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test PMD-level swap entries.
+ *
+ * Verifies that when a PMD-mapped THP is swapped out the kernel installs
+ * a single PMD-level swap entry (instead of splitting into 512 PTE-level
+ * entries), and that operations on the swapped region behave correctly:
+ *   basic         - swap out + swap in preserves data
+ *   fork          - parent and child both see the data
+ *   fork_cow      - COW after fork keeps parent's data isolated
+ *   cycles        - repeated swap out/in does not corrupt data
+ *   write         - faulting in via a write restores a PMD-mapped THP
+ *   munmap        - munmap on a PMD swap entry frees swap slots cleanly
+ *   mprotect      - mprotect on a PMD swap entry preserves data
+ *   mremap        - mremap on a PMD swap entry preserves data
+ *   pagemap        - pagemap reports the entries as swapped
+ *   mincore        - mincore walks a PMD swap entry without faulting it in
+ *   madvise_free   - MADV_FREE on a PMD swap entry does not crash
+ *   madvise_willneed - MADV_WILLNEED handles a PMD swap entry
+ *   uffdio_move    - UFFDIO_MOVE moves a PMD swap entry
+ *   swapoff        - swapoff handles PMD swap entries (needs PMD_SWAP_DEVICE)
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdint.h>
+#include <sys/random.h>
+#include <sys/swap.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <time.h>
+
+#include "kselftest_harness.h"
+#include "vm_util.h"
+
+#define ZSWAP_ENABLED_PATH "/sys/module/zswap/parameters/enabled"
+
+static bool check_swapped(int pagemap_fd, char *addr, unsigned long size)
+{
+	unsigned long off;
+
+	for (off = 0; off < size; off += getpagesize())
+		if (!pagemap_is_swapped(pagemap_fd, addr + off))
+			return false;
+	return true;
+}
+
+static bool zswap_enabled(void)
+{
+	char enabled = 0;
+	FILE *f;
+
+	f = fopen(ZSWAP_ENABLED_PATH, "r");
+	if (!f)
+		return false;
+
+	if (fscanf(f, " %c", &enabled) != 1)
+		enabled = 0;
+	fclose(f);
+
+	return enabled == 'Y' || enabled == 'y' || enabled == '1';
+}
+
+static bool swap_available(int pagemap_fd)
+{
+	char *p;
+	bool ret;
+
+	p = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (p == MAP_FAILED)
+		return false;
+
+	memset(p, 0xab, getpagesize());
+	madvise(p, getpagesize(), MADV_PAGEOUT);
+	ret = pagemap_is_swapped(pagemap_fd, p);
+	munmap(p, getpagesize());
+	return ret;
+}
+
+static unsigned long read_vm_event(const char *name)
+{
+	char line[256];
+	size_t name_len = strlen(name);
+	unsigned long val = 0;
+	FILE *f;
+
+	f = fopen("/proc/vmstat", "r");
+	if (!f)
+		return 0;
+	while (fgets(line, sizeof(line), f)) {
+		if (!strncmp(line, name, name_len) && line[name_len] == ' ') {
+			val = strtoul(line + name_len + 1, NULL, 10);
+			break;
+		}
+	}
+	fclose(f);
+	return val;
+}
+
+static unsigned int random_seed(void)
+{
+	unsigned int seed;
+
+	if (getrandom(&seed, sizeof(seed), 0) != sizeof(seed))
+		seed = (unsigned int)time(NULL);
+	return seed;
+}
+
+static unsigned char pattern_byte(unsigned int seed, unsigned long off)
+{
+	return (unsigned char)(seed + off);
+}
+
+static void fill_pattern(char *buf, unsigned long size, unsigned int seed)
+{
+	unsigned long i;
+
+	for (i = 0; i < size; i++)
+		buf[i] = (char)pattern_byte(seed, i);
+}
+
+static bool verify_pattern(char *buf, unsigned long size, unsigned int seed)
+{
+	unsigned long i;
+
+	for (i = 0; i < size; i++)
+		if ((unsigned char)buf[i] != pattern_byte(seed, i))
+			return false;
+	return true;
+}
+
+/*
+ * mmap an anonymous PMD-aligned region of pmd_size bytes. Over-allocates
+ * by one PMD and trims the unaligned head/tail so the returned address is
+ * PMD-aligned (required for whole-PMD UFFDIO_MOVE).
+ */
+static char *mmap_pmd_aligned(unsigned long pmd_size)
+{
+	unsigned long pad = pmd_size;
+	char *raw, *aligned;
+
+	raw = mmap(NULL, pmd_size + pad, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (raw == MAP_FAILED)
+		return MAP_FAILED;
+
+	aligned = (char *)(((uintptr_t)raw + pmd_size - 1) & ~(pmd_size - 1));
+	if (aligned != raw)
+		munmap(raw, aligned - raw);
+	if (aligned + pmd_size != raw + pmd_size + pad)
+		munmap(aligned + pmd_size,
+		       (raw + pmd_size + pad) - (aligned + pmd_size));
+	return aligned;
+}
+
+/*
+ * mmap a PMD-aligned PMD-sized region, request THP, fill with a pattern,
+ * and swap it out. Verifies via the thp_swpout_pmd vmstat counter that
+ * the swap-out installed a PMD swap entry rather than splitting to PTEs.
+ */
+static char *alloc_fill_swap_thp(unsigned long pmd_size, int pagemap_fd,
+				 unsigned int seed)
+{
+	unsigned long pmd_before, pmd_after;
+	char *mem;
+
+	mem = mmap_pmd_aligned(pmd_size);
+	if (mem == MAP_FAILED)
+		return MAP_FAILED;
+
+	madvise(mem, pmd_size, MADV_HUGEPAGE);
+	fill_pattern(mem, pmd_size, seed);
+
+	pmd_before = read_vm_event("thp_swpout_pmd");
+
+	if (madvise(mem, pmd_size, MADV_PAGEOUT) ||
+	    !check_swapped(pagemap_fd, mem, pmd_size)) {
+		munmap(mem, pmd_size);
+		return MAP_FAILED;
+	}
+
+	pmd_after = read_vm_event("thp_swpout_pmd");
+	printf("# thp_swpout_pmd: %lu -> %lu\n", pmd_before, pmd_after);
+	if (pmd_after - pmd_before < 1) {
+		munmap(mem, pmd_size);
+		return MAP_FAILED;
+	}
+	return mem;
+}
+
+FIXTURE(pmd_swap)
+{
+	unsigned long pmd_size;
+	int pagemap_fd;
+	unsigned int seed;
+	bool zswap_enabled;
+};
+
+FIXTURE_SETUP(pmd_swap)
+{
+	self->pagemap_fd = -1;
+
+	self->pmd_size = read_pmd_pagesize();
+	if (!self->pmd_size)
+		SKIP(return, "Cannot determine PMD size\n");
+
+	self->pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (self->pagemap_fd < 0)
+		SKIP(return, "Cannot open /proc/self/pagemap\n");
+
+	if (!swap_available(self->pagemap_fd))
+		SKIP(return, "Swap not available or not working\n");
+
+	self->seed = random_seed();
+	self->zswap_enabled = zswap_enabled();
+}
+
+FIXTURE_TEARDOWN(pmd_swap)
+{
+	if (self->pagemap_fd >= 0)
+		close(self->pagemap_fd);
+}
+
+/*
+ * Allocate a PMD-sized THP, write a pattern, swap it out, read it back,
+ * verify the pattern.
+ */
+TEST_F(pmd_swap, basic)
+{
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Allocate a THP, swap it out, fork, verify both parent and child see
+ * the correct data.
+ */
+TEST_F(pmd_swap, fork)
+{
+	char *mem;
+	pid_t pid;
+	int status;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0)
+		_exit(verify_pattern(mem, self->pmd_size, self->seed) ? 0 : 1);
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Swap out, fork, then have parent and child write different patterns.
+ * Exercises COW on shared PMD swap entries: writes after fork must
+ * trigger copy-on-write so the parent's data stays isolated.
+ */
+TEST_F(pmd_swap, fork_cow)
+{
+	unsigned int parent_seed = self->seed;
+	unsigned int child_seed = ~self->seed;
+	char *mem;
+	pid_t pid;
+	int status;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, parent_seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		fill_pattern(mem, self->pmd_size, child_seed);
+		_exit(verify_pattern(mem, self->pmd_size, child_seed) ? 0 : 1);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, parent_seed));
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Swap a THP out and in repeatedly without data corruption.
+ */
+TEST_F(pmd_swap, cycles)
+{
+	const int num_cycles = 5;
+	char *mem;
+	int cycle;
+
+	for (cycle = 0; cycle < num_cycles; cycle++) {
+		unsigned int seed = self->seed + cycle;
+
+		mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+		if (mem == MAP_FAILED)
+			SKIP(return, "Could not create swapped THP at cycle %d\n",
+			     cycle);
+
+		ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed));
+
+		munmap(mem, self->pmd_size);
+	}
+}
+
+/*
+ * Swap out, fault in via a write to the first page, verify the write
+ * reinstates a THP mapping and the rest of the THP is preserved.
+ */
+TEST_F(pmd_swap, write)
+{
+	unsigned int seed = self->seed;
+	char *mem;
+	unsigned long i;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	mem[0] = 0xbb;
+	ASSERT_EQ(mem[0], (char)0xbb);
+
+	if (self->zswap_enabled) {
+		TH_LOG("zswap is enabled, so PMD mapping is not checked");
+	} else {
+		ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size));
+	}
+
+	for (i = 1; i < self->pmd_size; i++)
+		ASSERT_EQ((unsigned char)mem[i], pattern_byte(seed, i));
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * munmap while the folio is swapped out. Exercises zap_huge_pmd() on a
+ * PMD swap entry — must free the swap slots without trying to look up
+ * a folio.
+ */
+TEST_F(pmd_swap, munmap)
+{
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Change protection on a swapped PMD entry, then fault back in and
+ * verify data. Exercises change_non_present_huge_pmd().
+ */
+TEST_F(pmd_swap, mprotect)
+{
+	unsigned int seed = self->seed;
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ), 0);
+	ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ | PROT_WRITE), 0);
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed));
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * UFFDIO_MOVE a PMD swap entry from src to a registered dst. Exercises
+ * move_pages_huge_pmd() handling of pmd_is_swap_entry: the whole PMD swap
+ * entry must move to dst without splitting, and the destination must
+ * read back the original pattern after a swap-in fault.
+ */
+TEST_F(pmd_swap, uffdio_move)
+{
+	unsigned int seed = self->seed;
+	struct uffdio_register reg = {};
+	struct uffdio_move move = {};
+	struct uffdio_api api = {};
+	char *src, *dst;
+	int uffd;
+
+	dst = mmap_pmd_aligned(self->pmd_size);
+	if (dst == MAP_FAILED)
+		SKIP(return, "Could not mmap aligned dst\n");
+
+	src = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (src == MAP_FAILED) {
+		munmap(dst, self->pmd_size);
+		SKIP(return, "Could not create swapped THP\n");
+	}
+	if ((uintptr_t)src & (self->pmd_size - 1)) {
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "src not PMD-aligned\n");
+	}
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "userfaultfd unavailable\n");
+	}
+
+	api.api = UFFD_API;
+	api.features = UFFD_FEATURE_MOVE;
+	if (ioctl(uffd, UFFDIO_API, &api) ||
+	    !(api.features & UFFD_FEATURE_MOVE)) {
+		close(uffd);
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "UFFD_FEATURE_MOVE unsupported\n");
+	}
+
+	reg.range.start = (unsigned long)dst;
+	reg.range.len = self->pmd_size;
+	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &reg)) {
+		close(uffd);
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "UFFDIO_REGISTER failed\n");
+	}
+
+	move.dst = (unsigned long)dst;
+	move.src = (unsigned long)src;
+	move.len = self->pmd_size;
+	if (ioctl(uffd, UFFDIO_MOVE, &move)) {
+		close(uffd);
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		ASSERT_EQ(errno, 0);
+	}
+	ASSERT_EQ(move.move, self->pmd_size);
+
+	/* dst inherits the PMD swap entry; reading it must restore the data. */
+	ASSERT_TRUE(check_swapped(self->pagemap_fd, dst, self->pmd_size));
+	ASSERT_TRUE(verify_pattern(dst, self->pmd_size, seed));
+	if (self->zswap_enabled) {
+		TH_LOG("zswap is enabled, so PMD mapping is not checked");
+	} else {
+		/* The whole-PMD path must reinstate a THP, not 512 PTE folios. */
+		ASSERT_TRUE(check_huge_anon(dst, 1, self->pmd_size));
+	}
+
+	close(uffd);
+	munmap(src, self->pmd_size);
+	munmap(dst, self->pmd_size);
+}
+
+/*
+ * Move a swapped PMD entry to a new address, fault in, verify data.
+ * Exercises move_huge_pmd() and move_soft_dirty_pmd().
+ */
+TEST_F(pmd_swap, mremap)
+{
+	unsigned int seed = self->seed;
+	char *mem, *new_mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	new_mem = mremap(mem, self->pmd_size, self->pmd_size, MREMAP_MAYMOVE);
+	if (new_mem == MAP_FAILED) {
+		munmap(mem, self->pmd_size);
+		ASSERT_NE(new_mem, MAP_FAILED);
+	}
+
+	ASSERT_TRUE(verify_pattern(new_mem, self->pmd_size, seed));
+
+	munmap(new_mem, self->pmd_size);
+}
+
+/*
+ * Read /proc/self/pagemap on a PMD swap entry. Exercises the pagemap
+ * PMD walker which must handle PMD swap entries without trying to
+ * convert them to a page via softleaf_to_page().
+ */
+TEST_F(pmd_swap, pagemap)
+{
+	char *mem;
+	uint64_t entry;
+	unsigned long off;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	for (off = 0; off < self->pmd_size; off += getpagesize()) {
+		entry = pagemap_get_entry(self->pagemap_fd, mem + off);
+		/* Bit 62 = swapped */
+		ASSERT_TRUE(entry & (1ULL << 62));
+	}
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * mincore() on a swapped-out PMD-mapped THP must handle the non-present PMD
+ * entry in place. The call must not fault the PMD back in or split the entry.
+ */
+TEST_F(pmd_swap, mincore)
+{
+	unsigned long pages = self->pmd_size / getpagesize();
+	unsigned char *vec;
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	vec = calloc(pages, sizeof(*vec));
+	ASSERT_NE(vec, NULL) {
+		munmap(mem, self->pmd_size);
+	}
+
+	ASSERT_EQ(mincore(mem, self->pmd_size, vec), 0) {
+		free(vec);
+		munmap(mem, self->pmd_size);
+	}
+	ASSERT_TRUE(check_swapped(self->pagemap_fd, mem, self->pmd_size)) {
+		free(vec);
+		munmap(mem, self->pmd_size);
+	}
+
+	free(vec);
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * MADV_FREE on a swapped-out PMD must free the swap slots and clear the
+ * entry. After the call, pagemap must no longer report the pages as
+ * swapped, and accessing the region must yield zero pages.
+ */
+TEST_F(pmd_swap, madvise_free)
+{
+	char *mem;
+	unsigned long i;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ASSERT_TRUE(check_swapped(self->pagemap_fd, mem, self->pmd_size));
+	ASSERT_EQ(madvise(mem, self->pmd_size, MADV_FREE), 0);
+	ASSERT_FALSE(check_swapped(self->pagemap_fd, mem, self->pmd_size));
+
+	for (i = 0; i < self->pmd_size; i += getpagesize())
+		ASSERT_EQ(mem[i], 0);
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * MADV_WILLNEED on a swapped-out PMD-mapped THP may schedule PMD-order
+ * swapin I/O, find the PMD-sized folio already resident in the swap cache,
+ * or split to the PTE path when zswap has per-page state for the range.
+ */
+TEST_F(pmd_swap, madvise_willneed)
+{
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ASSERT_EQ(madvise(mem, self->pmd_size, MADV_WILLNEED), 0);
+	ASSERT_TRUE(check_swapped(self->pagemap_fd, mem, self->pmd_size));
+
+	/* First touch faults the data back in. */
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+	if (self->zswap_enabled)
+		TH_LOG("zswap is enabled, so PMD mapping is not checked");
+	else
+		ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size));
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * swapoff requires a dedicated swap device path. Use a separate fixture
+ * that picks the device up from the PMD_SWAP_DEVICE environment variable
+ * and skips when unset.
+ */
+FIXTURE(pmd_swap_swapoff)
+{
+	unsigned long pmd_size;
+	int pagemap_fd;
+	const char *swap_dev;
+	unsigned int seed;
+	bool zswap_enabled;
+};
+
+FIXTURE_SETUP(pmd_swap_swapoff)
+{
+	self->pagemap_fd = -1;
+	self->swap_dev = getenv("PMD_SWAP_DEVICE");
+	if (!self->swap_dev)
+		SKIP(return, "PMD_SWAP_DEVICE env var not set\n");
+
+	self->pmd_size = read_pmd_pagesize();
+	if (!self->pmd_size)
+		SKIP(return, "Cannot determine PMD size\n");
+
+	self->pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (self->pagemap_fd < 0)
+		SKIP(return, "Cannot open /proc/self/pagemap\n");
+
+	if (!swap_available(self->pagemap_fd))
+		SKIP(return, "Swap not available or not working\n");
+
+	self->seed = random_seed();
+	self->zswap_enabled = zswap_enabled();
+}
+
+FIXTURE_TEARDOWN(pmd_swap_swapoff)
+{
+	if (self->pagemap_fd >= 0)
+		close(self->pagemap_fd);
+}
+
+/*
+ * Swap out a THP, then turn off swap. Verify data is intact. When zswap is
+ * not active, the PMD-order swapoff path should preserve the huge mapping.
+ */
+TEST_F(pmd_swap_swapoff, basic)
+{
+	unsigned int seed = self->seed;
+	char *mem;
+	int ret, err;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ret = swapoff(self->swap_dev);
+	err = errno;
+	ASSERT_EQ(ret, 0) {
+		TH_LOG("swapoff(%s) failed: %s", self->swap_dev, strerror(err));
+		munmap(mem, self->pmd_size);
+	}
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)) {
+		swapon(self->swap_dev, 0);
+		munmap(mem, self->pmd_size);
+	}
+
+	if (self->zswap_enabled) {
+		TH_LOG("zswap is enabled, so PMD mapping is not checked");
+	} else {
+		ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size)) {
+			swapon(self->swap_dev, 0);
+			munmap(mem, self->pmd_size);
+		}
+	}
+
+	ret = swapon(self->swap_dev, 0);
+	err = errno;
+	ASSERT_EQ(ret, 0) {
+		TH_LOG("swapon(%s) failed: %s", self->swap_dev, strerror(err));
+		munmap(mem, self->pmd_size);
+	}
+
+	munmap(mem, self->pmd_size);
+}
+
+TEST_HARNESS_MAIN
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 11/11] selftests/mm: add PMD swap entry tests
  2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
@ 2026-07-04  6:27   ` kernel test robot
  2026-07-04  8:30   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2026-07-04  6:27 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: oe-kbuild-all, Linux Memory Management List, ying.huang,
	Baoquan He, willy, youngjun.park, hannes, riel, shakeel.butt,
	alex, kas, baohua, dev.jain, baolin.wang, npache, Liam R. Howlett,
	ryan.roberts, Vlastimil Babka, lance.yang, linux-kernel, nphamcs,
	shikemeng, kernel-team, Usama Arif

Hi Usama,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-add-PMD-swap-entry-detection-support/20260704-014151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260703173903.3789516-12-usama.arif%40linux.dev
patch subject: [PATCH v3 11/11] selftests/mm: add PMD swap entry tests
config: riscv-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260704/202607040838.eEdmRDmU-lkp@intel.com/config)
compiler: riscv64-linux-gnu-gcc (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260704/202607040838.eEdmRDmU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202607040838.eEdmRDmU-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from ./include/linux/pgtable.h:6,
                    from ./include/linux/mm.h:31,
                    from arch/riscv/kernel/asm-offsets.c:9:
   ./arch/riscv/include/asm/pgtable.h: In function 'pmd_swp_exclusive':
>> ./arch/riscv/include/asm/pgtable.h:935:16: error: implicit declaration of function 'pte_swp_exclusive'; did you mean 'pmd_swp_exclusive'? [-Wimplicit-function-declaration]
     935 |         return pte_swp_exclusive(pmd_pte(pmd));
         |                ^~~~~~~~~~~~~~~~~
         |                pmd_swp_exclusive
   ./arch/riscv/include/asm/pgtable.h: In function 'pmd_swp_mkexclusive':
>> ./arch/riscv/include/asm/pgtable.h:940:24: error: implicit declaration of function 'pte_swp_mkexclusive'; did you mean 'pmd_swp_mkexclusive'? [-Wimplicit-function-declaration]
     940 |         return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)));
         |                        ^~~~~~~~~~~~~~~~~~~
         |                        pmd_swp_mkexclusive
>> ./arch/riscv/include/asm/pgtable.h:940:24: error: incompatible type for argument 1 of 'pte_pmd'
     940 |         return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)));
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |                        |
         |                        int
   ./arch/riscv/include/asm/pgtable.h:758:35: note: expected 'pte_t' but argument is of type 'int'
     758 | static inline pmd_t pte_pmd(pte_t pte)
         |                             ~~~~~~^~~
   ./arch/riscv/include/asm/pgtable.h: In function 'pmd_swp_clear_exclusive':
>> ./arch/riscv/include/asm/pgtable.h:945:24: error: implicit declaration of function 'pte_swp_clear_exclusive'; did you mean 'pmd_swp_clear_exclusive'? [-Wimplicit-function-declaration]
     945 |         return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)));
         |                        ^~~~~~~~~~~~~~~~~~~~~~~
         |                        pmd_swp_clear_exclusive
   ./arch/riscv/include/asm/pgtable.h:945:24: error: incompatible type for argument 1 of 'pte_pmd'
     945 |         return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)));
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |                        |
         |                        int
   ./arch/riscv/include/asm/pgtable.h:758:35: note: expected 'pte_t' but argument is of type 'int'
     758 | static inline pmd_t pte_pmd(pte_t pte)
         |                             ~~~~~~^~~
   ./arch/riscv/include/asm/pgtable.h: At top level:
>> ./arch/riscv/include/asm/pgtable.h:1218:20: error: conflicting types for 'pte_swp_exclusive'; have 'bool(pte_t)' {aka '_Bool(pte_t)'}
    1218 | static inline bool pte_swp_exclusive(pte_t pte)
         |                    ^~~~~~~~~~~~~~~~~
   ./arch/riscv/include/asm/pgtable.h:935:16: note: previous implicit declaration of 'pte_swp_exclusive' with type 'int()'
     935 |         return pte_swp_exclusive(pmd_pte(pmd));
         |                ^~~~~~~~~~~~~~~~~
>> ./arch/riscv/include/asm/pgtable.h:1223:21: error: conflicting types for 'pte_swp_mkexclusive'; have 'pte_t(pte_t)'
    1223 | static inline pte_t pte_swp_mkexclusive(pte_t pte)
         |                     ^~~~~~~~~~~~~~~~~~~
   ./arch/riscv/include/asm/pgtable.h:940:24: note: previous implicit declaration of 'pte_swp_mkexclusive' with type 'int()'
     940 |         return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)));
         |                        ^~~~~~~~~~~~~~~~~~~
>> ./arch/riscv/include/asm/pgtable.h:1228:21: error: conflicting types for 'pte_swp_clear_exclusive'; have 'pte_t(pte_t)'
    1228 | static inline pte_t pte_swp_clear_exclusive(pte_t pte)
         |                     ^~~~~~~~~~~~~~~~~~~~~~~
   ./arch/riscv/include/asm/pgtable.h:945:24: note: previous implicit declaration of 'pte_swp_clear_exclusive' with type 'int()'
     945 |         return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)));
         |                        ^~~~~~~~~~~~~~~~~~~~~~~


vim +935 ./arch/riscv/include/asm/pgtable.h

   932	
   933	static inline bool pmd_swp_exclusive(pmd_t pmd)
   934	{
 > 935		return pte_swp_exclusive(pmd_pte(pmd));
   936	}
   937	
   938	static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
   939	{
 > 940		return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)));
   941	}
   942	
   943	static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
   944	{
 > 945		return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)));
   946	}
   947	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 11/11] selftests/mm: add PMD swap entry tests
  2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
  2026-07-04  6:27   ` kernel test robot
@ 2026-07-04  8:30   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2026-07-04  8:30 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: oe-kbuild-all, Linux Memory Management List, ying.huang,
	Baoquan He, willy, youngjun.park, hannes, riel, shakeel.butt,
	alex, kas, baohua, dev.jain, baolin.wang, npache, Liam R. Howlett,
	ryan.roberts, Vlastimil Babka, lance.yang, linux-kernel, nphamcs,
	shikemeng, kernel-team, Usama Arif

Hi Usama,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-add-PMD-swap-entry-detection-support/20260704-014151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260703173903.3789516-12-usama.arif%40linux.dev
patch subject: [PATCH v3 11/11] selftests/mm: add PMD swap entry tests
config: i386-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260704/202607041039.WHXO0OYv-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260704/202607041039.WHXO0OYv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202607041039.WHXO0OYv-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from ./include/linux/mm.h:31,
                    from ./include/linux/memcontrol.h:21,
                    from ./include/linux/swap.h:9,
                    from ./include/linux/suspend.h:5,
                    from arch/x86/kernel/asm-offsets.c:14:
>> ./include/linux/pgtable.h:1921:21: error: redefinition of 'pmd_swp_mkexclusive'
    1921 | static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
         |                     ^~~~~~~~~~~~~~~~~~~
   In file included from ./arch/x86/include/asm/tlbflush.h:17,
                    from ./arch/x86/include/asm/uaccess.h:17,
                    from ./include/linux/uaccess.h:13,
                    from ./include/linux/sched/task.h:13,
                    from ./include/linux/sched/signal.h:9,
                    from ./include/linux/rcuwait.h:6,
                    from ./include/linux/percpu-rwsem.h:7,
                    from ./include/linux/fs/super_types.h:13,
                    from ./include/linux/fs/super.h:5,
                    from ./include/linux/fs.h:5,
                    from ./include/linux/cgroup.h:17,
                    from ./include/linux/memcontrol.h:13:
   ./arch/x86/include/asm/pgtable.h:1532:21: note: previous definition of 'pmd_swp_mkexclusive' with type 'pmd_t(pmd_t)'
    1532 | static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
         |                     ^~~~~~~~~~~~~~~~~~~
>> ./include/linux/pgtable.h:1926:20: error: conflicting types for 'pmd_swp_exclusive'; have 'bool(pmd_t)' {aka '_Bool(pmd_t)'}
    1926 | static inline bool pmd_swp_exclusive(pmd_t pmd)
         |                    ^~~~~~~~~~~~~~~~~
   ./arch/x86/include/asm/pgtable.h:1537:19: note: previous definition of 'pmd_swp_exclusive' with type 'int(pmd_t)'
    1537 | static inline int pmd_swp_exclusive(pmd_t pmd)
         |                   ^~~~~~~~~~~~~~~~~
>> ./include/linux/pgtable.h:1931:21: error: redefinition of 'pmd_swp_clear_exclusive'
    1931 | static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
         |                     ^~~~~~~~~~~~~~~~~~~~~~~
   ./arch/x86/include/asm/pgtable.h:1542:21: note: previous definition of 'pmd_swp_clear_exclusive' with type 'pmd_t(pmd_t)'
    1542 | static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
         |                     ^~~~~~~~~~~~~~~~~~~~~~~


vim +/pmd_swp_mkexclusive +1921 ./include/linux/pgtable.h

  1919	
  1920	#ifndef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
> 1921	static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
  1922	{
  1923		return pmd;
  1924	}
  1925	
> 1926	static inline bool pmd_swp_exclusive(pmd_t pmd)
  1927	{
  1928		return false;
  1929	}
  1930	
> 1931	static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
  1932	{
  1933		return pmd;
  1934	}
  1935	#endif
  1936	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-07-04  8:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
2026-07-03 17:38 ` [PATCH v3 02/11] mm: add PMD swap entry splitting support Usama Arif
2026-07-03 17:38 ` [PATCH v3 03/11] mm: handle PMD swap entries in fork path Usama Arif
2026-07-03 17:38 ` [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin Usama Arif
2026-07-03 17:38 ` [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-07-03 17:38 ` [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-07-03 17:38 ` [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
2026-07-03 17:38 ` [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-07-03 17:38 ` [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-07-03 17:38 ` [PATCH v3 10/11] mm: install PMD swap entries on swap-out Usama Arif
2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
2026-07-04  6:27   ` kernel test robot
2026-07-04  8:30   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox