[PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs
@ 2026-04-27 10:01 Usama Arif
  2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
                   ` (14 more replies)
  0 siblings, 15 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
unmap.

This series introduces a PMD-level swap entry. The huge mapping is
preserved across the swap round-trip, and do_huge_pmd_swap_page()
resolves the entire 2 MB region in a single fault on swap-in,
no khugepaged involvement is needed. swap_map metadata is identical
either way (512 single-slot counts), so the PTE split buys nothing
on the swap side, it is purely a page-table representation change.

This work was brought about after Hugh reported that one of the
major blockers for having lazy page table deposit is the lack of
PMD swap entries [1]. However, this series has benefits of its
own:
- The huge mapping is restored on swap-in.  Today even when the
  folio is still in swap cache as a single 2 MB folio, the swap-in
  path installs 512 PTE mappings -- the PMD mapping is gone, the
  freshly-materialised PTE table sticks around, and only
  khugepaged can later collapse the range back into a THP.
  do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
  one fault, no khugepaged involvement.
- Memory saved per swapped-out THP *once lazy page table deposit is
  merged* [2]. With lazy page table deposit [2], splitting a PMD into
  512 PTE swap entries forces allocation of a 4 KB PTE table page.
  The new path leaves the pgtable hierarchy at PMD level and avoids
  that allocation entirely.
  This will save memory when swapping, which is likely when there is
  memory pressure and exactly when allocations are most likely to
  fail.
- Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
  visit one PMD entry instead of 512 PTEs, reducing traversal
  time and lock-hold windows.

The swap entry value is identical to 512 PTE swap entries (same
type, same starting offset), so swap_map refcounting is unchanged.
Only the page-table representation differs; the swap slot allocator,
swap I/O, and swap cache are untouched.  The new path falls back to
the existing PTE-split path whenever a PMD-order resource is
unavailable: zswap enabled, non-contiguous swap allocation
(THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
or fork, racing folio split, or rmap-driven split on a swapcache
folio.  Walkers that previously assumed every non-present PMD encodes
a PFN (migration / device_private) are taught to recognise PMD swap
entries.

Patch breakdown:

The series is ordered to preserve git bisectability: every consumer
of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
swap-in fault) lands before the producer.  The swap-out path that
actually installs PMD swap entries is the very last functional patch
(12), so no intermediate commit can leave the kernel handling a
PMD swap entry it does not yet understand.

The first 4 patches are preparatory patches. Some of them (like
softleaf_to_pmd() change in patch 1) are not exactly needed but its
done to hopefully improve code quality and so that the PMD swap
entry changes look well integrated with the rest of mm.

Prep patches:
  1. mm: add softleaf_to_pmd() and convert existing callers
     PMD counterpart to softleaf_to_pte(); needed to construct a
     PMD from a swap entry in later patches.
  2. mm: extract ensure_on_mmlist() helper
     Hoists the "register mm with swapoff" double-checked-locking
     pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
     the PMD swap-out and PMD fork paths can reuse it without a
     third open-coded copy.
  3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
     pagemap_pmd_range_thp() today calls softleaf_to_page()
     unconditionally; a PMD swap entry has no PFN and would crash
     it.
  4. mm/huge_memory: move softleaf_to_folio() inside migration branch
     change_non_present_huge_pmd() today calls softleaf_to_folio()
     before branching on entry type, so a PMD swap entry would
     produce a bogus folio pointer that the migration-only code
     below would then dereference.

Core patches:
  5. PMD swap entry detection (pmd_is_swap_entry,
     softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
     helpers (x86/arm64/s390/riscv/loongarch).
  6. __split_huge_pmd_locked() learns to split a PMD swap entry
     into 512 PTE swap entries, used as the fallback when a
     PMD-order resource is unavailable.
  7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
     in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
     copy_pte_range().
  8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
     the PMD; falls back to PTE-split + unuse_pte_range() on error.
  9. Walker updates: zap_huge_pmd, change_huge_pmd,
     change_non_present_huge_pmd, move_soft_dirty_pmd,
     clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
     queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
     and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
     VM_BUG_ON extensions.
 10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
     entry whole via a new move_swap_pmd() helper modeled on
     move_swap_pte().
 11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
     one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
     mapping, immediate COW for write faults; falls back to PTE-split
     on any PMD-order resource shortfall.
 12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
     PMD-mappable swapcache folios (when zswap is disabled), and
     try_to_unmap_one() installs one PMD swap entry via
     set_pmd_swap_entry() instead of splitting.

Testing:
 13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
     repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
     MADV_FREE, UFFDIO_MOVE, swapoff.

Making PMD swap entries work with zswap is another project on its own and
should be in a separate follow up series.

The patches are on top of mm-unstable from 23 April
(2bcc13c29c711381d815c1ba5d5b25737400c71a).

[1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
[2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/

Usama Arif (13):
  mm: add softleaf_to_pmd() and convert existing callers
  mm: extract ensure_on_mmlist() helper
  fs/proc: use softleaf_has_pfn() in pagemap PMD walker
  mm/huge_memory: move softleaf_to_folio() inside migration branch
  mm: add PMD swap entry detection support
  mm: add PMD swap entry splitting support
  mm: handle PMD swap entries in fork path
  mm: swap in PMD swap entries as whole THPs during swapoff
  mm: handle PMD swap entries in non-present PMD walkers
  mm: handle PMD swap entries in UFFDIO_MOVE
  mm: handle PMD swap entry faults on swap-in
  mm: install PMD swap entries on swap-out
  selftests/mm: add PMD swap entry tests

 arch/arm64/include/asm/pgtable.h      |   4 +
 arch/loongarch/include/asm/pgtable.h  |  17 +
 arch/riscv/include/asm/pgtable.h      |  15 +
 arch/s390/include/asm/pgtable.h       |  15 +
 arch/x86/include/asm/pgtable.h        |  15 +
 fs/proc/task_mmu.c                    |  47 +-
 include/linux/huge_mm.h               |  11 +
 include/linux/leafops.h               |  44 +-
 include/linux/swap.h                  |   4 +-
 include/linux/vm_event_item.h         |   1 +
 mm/hmm.c                              |   3 +-
 mm/huge_memory.c                      | 540 +++++++++++++++++++++--
 mm/internal.h                         |  49 +++
 mm/khugepaged.c                       |   6 +
 mm/madvise.c                          |   5 +-
 mm/memory.c                           |  51 +--
 mm/mempolicy.c                        |   2 +
 mm/rmap.c                             |  27 +-
 mm/swap.h                             |   7 +
 mm/swap_state.c                       |  35 ++
 mm/swapfile.c                         | 144 +++++-
 mm/vmscan.c                           |  14 +-
 mm/vmstat.c                           |   1 +
 tools/testing/selftests/mm/Makefile   |   1 +
 tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
 25 files changed, 1554 insertions(+), 111 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pmd_swap.c

-- 
2.52.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Add softleaf_to_pmd() as the PMD counterpart to softleaf_to_pte(),
completing the symmetry of the softleaf abstraction for page table
leaf entries.

The upcoming PMD swap entry support needs to construct PMD entries
from swap entries. Converting existing swp_entry_to_pmd() callers
to softleaf_to_pmd() in a prep patch keeps the feature patches
focused on new functionality rather than mixing refactoring with
new code.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/leafops.h | 20 ++++++++++++++++++++
 mm/huge_memory.c        | 12 ++++++------
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 992cd8bd8ed0..803d312437df 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -108,6 +108,21 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
 
+/**
+ * softleaf_to_pmd() - Obtain a PMD entry from a leaf entry.
+ * @entry: Leaf entry.
+ *
+ * This generates an architecture-specific PMD entry that can be utilised to
+ * encode the metadata the leaf entry encodes.
+ *
+ * Returns: Architecture-specific PMD entry encoding leaf entry.
+ */
+static inline pmd_t softleaf_to_pmd(softleaf_t entry)
+{
+	/* Temporary until swp_entry_t eliminated. */
+	return swp_entry_to_pmd(entry);
+}
+
 #else
 
 static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
@@ -115,6 +130,11 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
 	return softleaf_mk_none();
 }
 
+static inline pmd_t softleaf_to_pmd(softleaf_t entry)
+{
+	return __pmd(0);
+}
+
 #endif
 
 /**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..49da0746b8ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1881,7 +1881,7 @@ static void copy_huge_non_present_pmd(
 	if (softleaf_is_migration_write(entry) ||
 	    softleaf_is_migration_read_exclusive(entry)) {
 		entry = make_readable_migration_entry(swp_offset(entry));
-		pmd = swp_entry_to_pmd(entry);
+		pmd = softleaf_to_pmd(entry);
 		if (pmd_swp_soft_dirty(*src_pmd))
 			pmd = pmd_swp_mksoft_dirty(pmd);
 		if (pmd_swp_uffd_wp(*src_pmd))
@@ -1894,7 +1894,7 @@ static void copy_huge_non_present_pmd(
 		 */
 		if (softleaf_is_device_private_write(entry)) {
 			entry = make_readable_device_private_entry(swp_offset(entry));
-			pmd = swp_entry_to_pmd(entry);
+			pmd = softleaf_to_pmd(entry);
 
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
@@ -2632,12 +2632,12 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
 			entry = make_readable_exclusive_migration_entry(swp_offset(entry));
 		else
 			entry = make_readable_migration_entry(swp_offset(entry));
-		newpmd = swp_entry_to_pmd(entry);
+		newpmd = softleaf_to_pmd(entry);
 		if (pmd_swp_soft_dirty(*pmd))
 			newpmd = pmd_swp_mksoft_dirty(newpmd);
 	} else if (softleaf_is_device_private_write(entry)) {
 		entry = make_readable_device_private_entry(swp_offset(entry));
-		newpmd = swp_entry_to_pmd(entry);
+		newpmd = softleaf_to_pmd(entry);
 	} else {
 		newpmd = *pmd;
 	}
@@ -5014,7 +5014,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		entry = make_migration_entry_young(entry);
 	if (pmd_dirty(pmdval))
 		entry = make_migration_entry_dirty(entry);
-	pmdswp = swp_entry_to_pmd(entry);
+	pmdswp = softleaf_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	if (pmd_uffd_wp(pmdval))
@@ -5065,7 +5065,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		else
 			entry = make_readable_device_private_entry(
 							page_to_pfn(new));
-		pmde = swp_entry_to_pmd(entry);
+		pmde = softleaf_to_pmd(entry);
 
 		if (pmd_swp_soft_dirty(*pvmw->pmd))
 			pmde = pmd_swp_mksoft_dirty(pmde);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 02/13] mm: extract ensure_on_mmlist() helper
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
  2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

When a swap entry is installed in a page table, the mm must be added
to init_mm.mmlist so that swapoff can find and unuse its swap entries.
This double-checked locking pattern is currently open-coded in
try_to_unmap_one() and copy_nonpresent_pte().

Move it into ensure_on_mmlist() in mm/internal.h and convert both
callers so it can be reused by upcoming PMD-level swap entry code
paths that also need to register the mm with swapoff.

copy_nonpresent_pte() previously inserted into &src_mm->mmlist rather
than &init_mm.mmlist, but the insertion point is irrelevant, mmlist
is a circular list and swapoff walks it entirely from init_mm.mmlist,
so only membership matters, not position.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/internal.h | 13 +++++++++++++
 mm/memory.c   |  9 +--------
 mm/rmap.c     |  7 +------
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..7de489689f54 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1952,4 +1952,17 @@ static inline int get_sysctl_max_map_count(void)
 bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags,
 		   unsigned long npages);
 
+/*
+ * Ensure @mm is on the init_mm.mmlist so swapoff can find it.
+ */
+static inline void ensure_on_mmlist(struct mm_struct *mm)
+{
+	if (list_empty(&mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		if (list_empty(&mm->mmlist))
+			list_add(&mm->mmlist, &init_mm.mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+}
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index ea6568571131..33d7cc274e23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -937,14 +937,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (swap_dup_entry_direct(entry) < 0)
 			return -EIO;
 
-		/* make sure dst_mm is on swapoff's mmlist. */
-		if (unlikely(list_empty(&dst_mm->mmlist))) {
-			spin_lock(&mmlist_lock);
-			if (list_empty(&dst_mm->mmlist))
-				list_add(&dst_mm->mmlist,
-						&src_mm->mmlist);
-			spin_unlock(&mmlist_lock);
-		}
+		ensure_on_mmlist(dst_mm);
 		/* Mark the swap entry as shared. */
 		if (pte_swp_exclusive(orig_pte)) {
 			pte = pte_swp_clear_exclusive(orig_pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 78b7fb5f367c..057e18cb80b0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2302,12 +2302,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
-			if (list_empty(&mm->mmlist)) {
-				spin_lock(&mmlist_lock);
-				if (list_empty(&mm->mmlist))
-					list_add(&mm->mmlist, &init_mm.mmlist);
-				spin_unlock(&mmlist_lock);
-			}
+			ensure_on_mmlist(mm);
 			dec_mm_counter(mm, MM_ANONPAGES);
 			inc_mm_counter(mm, MM_SWAPENTS);
 			swp_pte = swp_entry_to_pte(entry);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
  2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
  2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

pagemap_pmd_range_thp() assumes that every non-present PMD is a
migration entry and unconditionally calls softleaf_to_page().  This
will crash on any non-present PMD type that does not encode a PFN,
such as the upcoming PMD-level swap entries.

Guard the page lookup with softleaf_has_pfn(), matching how
pte_to_pagemap_entry() already handles non-present PTEs.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/proc/task_mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 751b9ba160fb..6d9f43881e62 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2042,8 +2042,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
 			flags |= PM_SOFT_DIRTY;
 		if (pmd_swp_uffd_wp(pmd))
 			flags |= PM_UFFD_WP;
-		VM_WARN_ON_ONCE(!pmd_is_migration_entry(pmd));
-		page = softleaf_to_page(entry);
+		if (softleaf_has_pfn(entry))
+			page = softleaf_to_page(entry);
 	}
 
 	if (page) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (2 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

change_non_present_huge_pmd() calls softleaf_to_folio() unconditionally
at the top of the function.  softleaf_to_folio() extracts a PFN from
the entry and converts it to a folio pointer, which is only meaningful
for migration and device_private entries that encode a real PFN.

A swap entry encodes a swap offset instead, so softleaf_to_folio()
would produce a bogus pointer and crash on mprotect() when a PMD swap
entry is present.

Move the call into the migration_write branch where the folio is
actually used, so the function is safe for any non-present PMD type.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 49da0746b8ca..d82a19b5e276 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2619,11 +2619,12 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
 		bool uffd_wp_resolve)
 {
 	softleaf_t entry = softleaf_from_pmd(*pmd);
-	const struct folio *folio = softleaf_to_folio(entry);
 	pmd_t newpmd;
 
 	VM_WARN_ON(!pmd_is_valid_softleaf(*pmd));
 	if (softleaf_is_migration_write(entry)) {
+		const struct folio *folio = softleaf_to_folio(entry);
+
 		/*
 		 * A protection check is difficult so
 		 * just be safe and disable write
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 05/13] mm: add PMD swap entry detection support
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (3 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Currently when a PMD-mapped THP is swapped out, the PMD is always split
into 512 PTE-level swap entries.  To preserve huge page information
across swap cycles, later patches will install a single PMD-level swap
entry instead.  This patch adds the infrastructure to detect those
entries.

Teach the softleaf layer to recognise PMD swap entries:
pmd_is_swap_entry() detects them and softleaf_is_valid_pmd_entry()
accepts them as a valid non-present type.  Clear the exclusive overlay
bit in softleaf_from_pmd() before decoding, matching how soft_dirty and
uffd_wp bits are already stripped.

Add pmd_swp_mkexclusive(), pmd_swp_exclusive(), and
pmd_swp_clear_exclusive() helpers to each architecture that supports
THP migration (x86, arm64, s390, riscv, loongarch), mirroring the
existing PTE swap exclusive helpers in each arch's pgtable.h.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/arm64/include/asm/pgtable.h     |  4 ++++
 arch/loongarch/include/asm/pgtable.h | 17 +++++++++++++++++
 arch/riscv/include/asm/pgtable.h     | 15 +++++++++++++++
 arch/s390/include/asm/pgtable.h      | 15 +++++++++++++++
 arch/x86/include/asm/pgtable.h       | 15 +++++++++++++++
 include/linux/leafops.h              | 18 ++++++++++++++++--
 6 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9029b81ccbe8..ecb0ef6994cb 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -601,6 +601,10 @@ static inline int pmd_protnone(pmd_t pmd)
 #define pmd_swp_clear_uffd_wp(pmd) \
 				pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)))
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+#define pmd_swp_exclusive(pmd)	pte_swp_exclusive(pmd_pte(pmd))
+#define pmd_swp_mkexclusive(pmd)	pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)))
+#define pmd_swp_clear_exclusive(pmd) \
+				pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)))
 
 #define pmd_write(pmd)		pte_write(pmd_pte(pmd))
 
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 155f70e93460..f8e7761eb54e 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -345,6 +345,23 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return pte;
 }
 
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	pmd_val(pmd) |= _PAGE_SWP_EXCLUSIVE;
+	return pmd;
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	pmd_val(pmd) &= ~_PAGE_SWP_EXCLUSIVE;
+	return pmd;
+}
+
 #define pte_none(pte)		(!(pte_val(pte) & ~_PAGE_GLOBAL))
 #define pte_present(pte)	(pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_no_exec(pte)	(pte_val(pte) & _PAGE_NO_EXEC)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index a6e0eaba2653..f4cd59ebab58 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -935,6 +935,21 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return pte_swp_exclusive(pmd_pte(pmd));
+}
+
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)));
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)));
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline bool pmd_soft_dirty(pmd_t pmd)
 {
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 40a6fb19dd1d..9b05fd3e4df0 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -868,6 +868,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
 }
 
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return set_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return clear_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_SOFT_DIRTY;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 13e3e9a054cb..eb8b7a6f4bb4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1517,6 +1517,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
 }
 
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pmd_swp_exclusive(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_EXCLUSIVE);
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 803d312437df..79e04db45bfb 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -102,6 +102,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
 		pmd = pmd_swp_clear_soft_dirty(pmd);
 	if (pmd_swp_uffd_wp(pmd))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
+	if (pmd_swp_exclusive(pmd))
+		pmd = pmd_swp_clear_exclusive(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 
 	/* Temporary until swp_entry_t eliminated. */
@@ -634,9 +636,21 @@ static inline bool pmd_is_migration_entry(pmd_t pmd)
  */
 static inline bool softleaf_is_valid_pmd_entry(softleaf_t entry)
 {
-	/* Only device private, migration entries valid for PMD. */
+	/* Device private, migration, and swap entries valid for PMD. */
 	return softleaf_is_device_private(entry) ||
-		softleaf_is_migration(entry);
+		softleaf_is_migration(entry) ||
+		softleaf_is_swap(entry);
+}
+
+/**
+ * pmd_is_swap_entry() - Does this PMD entry encode an actual swap entry?
+ * @pmd: PMD entry.
+ *
+ * Returns: true if the PMD encodes a swap entry, otherwise false.
+ */
+static inline bool pmd_is_swap_entry(pmd_t pmd)
+{
+	return softleaf_is_swap(softleaf_from_pmd(pmd));
 }
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 06/13] mm: add PMD swap entry splitting support
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (4 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Add a swap branch in __split_huge_pmd_locked() that splits a PMD swap
entry into 512 PTE swap entries.  Unlike migration splits, no folio
reference is needed because swap entries point to swap slots, not
pages.  Each PTE inherits the correct sub-slot offset and preserves
soft_dirty, uffd_wp, and exclusive flags.

This branch is reached from the explicit __split_huge_pmd() callers
that hit a non-present PMD: partial-range mprotect / munmap, the
wp_huge_pmd() PMD-COW fallback, and the swap-in / swapoff fallbacks
added in later patches when the cached folio is no longer PMD-sized.
page_vma_mapped_walk() does not iterate PMD swap entries, so
try_to_unmap_one() and try_to_migrate_one() do not reach this branch
and freeze=true cannot occur in this branch today.  page and folio
are therefore left uninitialized in the swap branch; a
VM_WARN_ON_ONCE(freeze) catches any future caller that breaks this
invariant before the freeze path dereferences page_to_pfn(page + i)
or put_page(page).

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/leafops.h |  6 +++---
 mm/huge_memory.c        | 27 ++++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 79e04db45bfb..2c0dfce6d0f0 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -657,9 +657,9 @@ static inline bool pmd_is_swap_entry(pmd_t pmd)
  * pmd_is_valid_softleaf() - Is this PMD entry a valid softleaf entry?
  * @pmd: PMD entry.
  *
- * PMD leaf entries are valid only if they are device private or migration
- * entries. This function asserts that a PMD leaf entry is valid in this
- * respect.
+ * PMD leaf entries are valid only if they are device private, migration,
+ * or swap entries. This function asserts that a PMD leaf entry is valid
+ * in this respect.
  *
  * Returns: true if the PMD entry is a valid leaf entry, otherwise false.
  */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d82a19b5e276..9f67638e43c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3201,6 +3201,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
 						 vma, haddr, rmap_flags);
 		}
+	} else if (pmd_is_swap_entry(*pmd)) {
+		VM_WARN_ON_ONCE(freeze);
+		old_pmd = *pmd;
+		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+		uffd_wp = pmd_swp_uffd_wp(old_pmd);
+		anon_exclusive = pmd_swp_exclusive(old_pmd);
 	} else {
 		/*
 		 * Up to this point the pmd is present and huge and userland has
@@ -3337,6 +3343,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
+	} else if (pmd_is_swap_entry(old_pmd)) {
+		softleaf_t sl_entry = softleaf_from_pmd(old_pmd);
+		pte_t swp_pte;
+		swp_entry_t sub_entry;
+
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR;
+		     i++, addr += PAGE_SIZE) {
+			sub_entry = swp_entry(swp_type(sl_entry),
+					      swp_offset(sl_entry) + i);
+			swp_pte = swp_entry_to_pte(sub_entry);
+			if (soft_dirty)
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (uffd_wp)
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			if (anon_exclusive)
+				swp_pte = pte_swp_mkexclusive(swp_pte);
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+			set_pte_at(mm, addr, pte + i, swp_pte);
+		}
 	} else {
 		pte_t entry;
 
@@ -3360,7 +3385,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_is_migration_entry(*pmd))
+	if (!pmd_is_migration_entry(*pmd) && !pmd_is_swap_entry(*pmd))
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 07/13] mm: handle PMD swap entries in fork path
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (5 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Teach copy_huge_pmd()/copy_huge_non_present_pmd() about swap entries,
mirroring copy_nonpresent_pte().

swap_dup_entry_direct() gains a nr parameter (and is renamed to
swap_dup_entries_direct()) so it can duplicate a contiguous range of
swap slots in one call, matching the existing
swap_put_entries_direct(entry, nr) API.  Existing callers pass 1.

copy_huge_non_present_pmd() "copies" PMD swap entries during fork
instead of splitting, preserving the THP.  This mirrors
copy_nonpresent_pte() which duplicates the swap slot refcount,
clears the exclusive bit on the source, and adds the destination
mm to mmlist.  If swap_dup_entries_direct() fails (GFP_ATOMIC table
alloc), copy_huge_pmd() retries after swap_retry_table_alloc() with
GFP_KERNEL, matching the PTE retry in copy_pte_range().  The PMD is
stable across the retry because dup_mmap() holds write mmap_lock on
both mm_structs.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/swap.h |  4 ++--
 mm/huge_memory.c     | 52 +++++++++++++++++++++++++++++++++++++++-----
 mm/memory.c          |  2 +-
 mm/swapfile.c        |  7 +++---
 4 files changed, 53 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1930f81e6be4..2f12c20baba1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -457,7 +457,7 @@ sector_t swap_folio_sector(struct folio *folio);
  * All entries must be allocated by folio_alloc_swap(). And they must have
  * a swap count > 1. See comments of folio_*_swap helpers for more info.
  */
-int swap_dup_entry_direct(swp_entry_t entry);
+int swap_dup_entries_direct(swp_entry_t entry, int nr);
 void swap_put_entries_direct(swp_entry_t entry, int nr);
 
 /*
@@ -501,7 +501,7 @@ static inline void free_swap_cache(struct folio *folio)
 {
 }
 
-static inline int swap_dup_entry_direct(swp_entry_t ent)
+static inline int swap_dup_entries_direct(swp_entry_t ent, int nr)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f67638e43c8..42887cf518cd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1867,7 +1867,7 @@ bool touch_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return false;
 }
 
-static void copy_huge_non_present_pmd(
+static int copy_huge_non_present_pmd(
 		struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
@@ -1913,14 +1913,35 @@ static void copy_huge_non_present_pmd(
 		 */
 		folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
 					    dst_vma, src_vma);
+	} else if (softleaf_is_swap(entry)) {
+		int err;
+
+		/*
+		 * PMD swap entry: duplicate swap references and clear
+		 * exclusive on source, matching copy_nonpresent_pte().
+		 */
+		err = swap_dup_entries_direct(entry, HPAGE_PMD_NR);
+		if (err < 0)
+			return err;
+
+		ensure_on_mmlist(dst_mm);
+
+		if (pmd_swp_exclusive(pmd)) {
+			pmd = pmd_swp_clear_exclusive(pmd);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
 	}
 
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	if (softleaf_is_swap(entry))
+		add_mm_counter(dst_mm, MM_SWAPENTS, HPAGE_PMD_NR);
+	else
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	return 0;
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -1961,6 +1982,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(!pgtable))
 		goto out;
 
+retry:
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -1968,10 +1990,28 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
-	if (unlikely(thp_migration_supported() &&
-		     pmd_is_valid_softleaf(pmd))) {
-		copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr,
-					  dst_vma, src_vma, pmd, pgtable);
+	if (unlikely(pmd_is_valid_softleaf(pmd))) {
+		ret = copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
+						addr, dst_vma, src_vma, pmd,
+						pgtable);
+		if (ret) {
+			spin_unlock(src_ptl);
+			spin_unlock(dst_ptl);
+			/*
+			 * For PMD swap entries -ENOMEM means the per-cluster
+			 * swap-extend table couldn't be GFP_ATOMIC-allocated.
+			 * try the GFP_KERNEL fallback once before giving up.
+			 */
+			if (ret == -ENOMEM) {
+				softleaf_t entry = softleaf_from_pmd(pmd);
+
+				if (softleaf_is_swap(entry) &&
+				    !swap_retry_table_alloc(entry, GFP_KERNEL))
+					goto retry;
+			}
+			pte_free(dst_mm, pgtable);
+			goto out;
+		}
 		ret = 0;
 		goto out_unlock;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 33d7cc274e23..8aa90afd601a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	struct page *page;
 
 	if (likely(softleaf_is_swap(entry))) {
-		if (swap_dup_entry_direct(entry) < 0)
+		if (swap_dup_entries_direct(entry, 1) < 0)
 			return -EIO;
 
 		ensure_on_mmlist(dst_mm);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c7e173b93e11..390f191be9a6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3801,8 +3801,9 @@ void si_swapinfo(struct sysinfo *val)
 }
 
 /*
- * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
+ * swap_dup_entries_direct() - Increase reference count of swap entries by one.
  * @entry: first swap entry from which we want to increase the refcount.
+ * @nr: number of contiguous swap entries to duplicate.
  *
  * Returns 0 for success, or -ENOMEM if the extend table is required
  * but could not be atomically allocated.  Returns -EINVAL if the swap
@@ -3814,7 +3815,7 @@ void si_swapinfo(struct sysinfo *val)
  * Also the swap entry must have a count >= 1. Otherwise folio_dup_swap should
  * be used.
  */
-int swap_dup_entry_direct(swp_entry_t entry)
+int swap_dup_entries_direct(swp_entry_t entry, int nr)
 {
 	struct swap_info_struct *si;
 
@@ -3831,7 +3832,7 @@ int swap_dup_entry_direct(swp_entry_t entry)
 	 */
 	VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry));
 
-	return swap_dup_entries_cluster(si, swp_offset(entry), 1);
+	return swap_dup_entries_cluster(si, swp_offset(entry), nr);
 }
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (6 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Add unuse_pmd() and call it from unuse_pmd_range() to swap in
PMD-level swap entries as whole THPs during swapoff.  This mirrors
the existing unuse_pte_range() but operates at PMD granularity.

If the PMD-order folio cannot be allocated, the cached folio is no
longer PMD-sized (e.g. split in the swap cache by
deferred_split_scan() or memory_failure() while the PMD swap entry
was installed), or the folio is not uptodate, the PMD swap entry is
split into PTE-level entries via __split_huge_pmd() and a non-zero
error is returned so unuse_pmd_range() falls through to
unuse_pte_range(), which handles the individual entries at order-0.

swapin_alloc_pmd_folio() is a separate function in swap_state.c
as it will be reused in swapin in a later patch.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/swap.h       |   7 +++
 mm/swap_state.c |  35 +++++++++++++
 mm/swapfile.c   | 137 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 179 insertions(+)

diff --git a/mm/swap.h b/mm/swap.h
index a77016f2423b..76752df71693 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -301,6 +301,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 		struct vm_fault *vmf);
 struct folio *swapin_folio(swp_entry_t entry, struct folio *folio);
+struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, struct mm_struct *mm);
 void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 			   unsigned long addr);
 
@@ -438,6 +439,12 @@ static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
 	return NULL;
 }
 
+static inline struct folio *swapin_alloc_pmd_folio(swp_entry_t entry,
+			struct mm_struct *mm)
+{
+	return NULL;
+}
+
 static inline void swap_update_readahead(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1415a5c54a43..c2e8c76658f5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -584,6 +584,41 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
 	return swapcache;
 }
 
+#ifdef CONFIG_THP_SWAP
+/**
+ * swapin_alloc_pmd_folio - allocate, charge, and read a PMD-sized swap folio.
+ * @entry: starting swap entry to swap in
+ * @mm: mm to charge for the swap-in
+ *
+ * Allocate a HPAGE_PMD_ORDER folio, charge it to @mm's memcg for @entry, and
+ * issue the swap-in via swapin_folio(). Used by callers that need to map a
+ * PMD swap entry as a whole THP (PMD swapoff).
+ *
+ * Return: the swapped-in folio, or NULL on alloc/charge/swapin failure (in
+ * which case the caller should fall back to splitting the PMD).
+ */
+struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, struct mm_struct *mm)
+{
+	struct folio *folio;
+
+	folio = folio_alloc(GFP_HIGHUSER_MOVABLE, HPAGE_PMD_ORDER);
+	if (!folio)
+		return NULL;
+
+	if (mem_cgroup_swapin_charge_folio(folio, mm, GFP_KERNEL, entry)) {
+		folio_put(folio);
+		return NULL;
+	}
+
+	if (!swapin_folio(entry, folio)) {
+		folio_put(folio);
+		return NULL;
+	}
+
+	return folio;
+}
+#endif /* CONFIG_THP_SWAP */
+
 /*
  * Locate a page of swap in physical memory, reserving swap cache space
  * and reading the disk if it is not already cached.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 390f191be9a6..7256edf4ce66 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,7 @@
 #include <linux/suspend.h>
 #include <linux/zswap.h>
 #include <linux/plist.h>
+#include <linux/huge_mm.h>
 
 #include <asm/tlbflush.h>
 #include <linux/leafops.h>
@@ -2519,6 +2520,130 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	return 0;
 }
 
+/*
+ * unuse_pmd - Map a locked folio at PMD granularity during swapoff.
+ *
+ * The caller provides a locked, swapped-in folio.  Returns 0 on success
+ * (PMD was mapped).  Returns -EAGAIN if the swap cache folio no longer
+ * matches the entry or the PMD changed under the lock (try_to_unuse will
+ * rescan).  Returns -EIO if the folio is not uptodate; in that case the
+ * PMD is split so unuse_pte_range() can handle individual pages.
+ */
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		     unsigned long addr, softleaf_t entry,
+		     struct folio *folio)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pmd_t new_pmd, old_pmd;
+	spinlock_t *ptl;
+	rmap_t rmap_flags = RMAP_NONE;
+	bool exclusive;
+
+	if (unlikely(!folio_matches_swap_entry(folio, entry)))
+		return -EAGAIN;
+
+	if (unlikely(!folio_test_uptodate(folio))) {
+		__split_huge_pmd(vma, pmd, addr, false);
+		return -EIO;
+	}
+
+	page = folio_page(folio, 0);
+
+	ptl = pmd_lock(mm, pmd);
+	old_pmd = pmdp_get(pmd);
+
+	if (!pmd_is_swap_entry(old_pmd) ||
+	    softleaf_from_pmd(old_pmd).val != entry.val) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	exclusive = pmd_swp_exclusive(old_pmd);
+
+	/*
+	 * Some architectures may have to restore extra metadata to the folio
+	 * when reading from swap. This metadata may be indexed by swap entry
+	 * so this must be called before folio_put_swap().
+	 */
+	arch_swap_restore(folio_swap(entry, folio), folio);
+
+	add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+	new_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+	new_pmd = pmd_mkold(new_pmd);
+	if (pmd_swp_soft_dirty(old_pmd))
+		new_pmd = pmd_mksoft_dirty(new_pmd);
+	if (pmd_swp_uffd_wp(old_pmd))
+		new_pmd = pmd_mkuffd_wp(new_pmd);
+
+	if (exclusive)
+		rmap_flags |= RMAP_EXCLUSIVE;
+
+	folio_get(folio);
+	if (!folio_test_anon(folio))
+		folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
+	else
+		folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags);
+
+	set_pmd_at(mm, addr, pmd, new_pmd);
+	folio_put_swap(folio, NULL);
+
+	spin_unlock(ptl);
+
+	folio_free_swap(folio);
+	return 0;
+}
+
+/*
+ * Try to swap in a PMD swap entry as a whole THP.  Returns 0 on success.
+ * Returns -ENOMEM if the PMD-order folio could not be allocated/charged,
+ * -EIO if swap-in failed, or -EAGAIN if the cached folio is no longer
+ * PMD-sized; in all of these the PMD is split so the caller can fall
+ * back to unuse_pte_range().  Otherwise propagates the error from
+ * unuse_pmd().
+ */
+static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd,
+			   unsigned long addr, softleaf_t entry)
+{
+	struct folio *folio;
+	int ret;
+
+	folio = swap_cache_get_folio(entry);
+	if (!folio) {
+		folio = swapin_alloc_pmd_folio(entry, vma->vm_mm);
+		if (!folio) {
+			ret = -ENOMEM;
+			goto split_fallback;
+		}
+	}
+
+	folio_lock(folio);
+	folio_wait_writeback(folio);
+	/*
+	 * If the cached folio is no longer PMD-sized (e.g. split in the
+	 * swap cache by deferred_split_scan() or memory_failure() while
+	 * the PMD swap entry was installed), the PMD swap entry no longer
+	 * maps a single contiguous folio.  Split the PMD swap entry so
+	 * unuse_pte_range() can swap the per-slot folios in individually.
+	 */
+	if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+		folio_unlock(folio);
+		folio_put(folio);
+		ret = -EAGAIN;
+		goto split_fallback;
+	}
+	ret = unuse_pmd(vma, pmd, addr, entry, folio);
+	folio_unlock(folio);
+	folio_put(folio);
+	return ret;
+
+split_fallback:
+	__split_huge_pmd(vma, pmd, addr, false);
+	return ret;
+}
+
 static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				unsigned int type)
@@ -2531,6 +2656,18 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	do {
 		cond_resched();
 		next = pmd_addr_end(addr, end);
+
+		pmd_t pmdval = pmdp_get(pmd);
+
+		if (pmd_is_swap_entry(pmdval)) {
+			softleaf_t sl = softleaf_from_pmd(pmdval);
+
+			if (swp_type(sl) == type) {
+				if (!unuse_pmd_entry(vma, pmd, addr, sl))
+					continue;
+			}
+		}
+
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
 		if (ret)
 			return ret;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (7 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:01 ` [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Teach the remaining non-present PMD walkers about swap entries,
mirroring the PTE-level equivalents.

smaps_pmd_entry() accounts swap and swap_pss via a new shared
smaps_account_swap() helper used by both PTE and PMD paths.

zap_huge_pmd() frees swap slots via swap_put_entries_direct(),
matching zap_nonpresent_ptes().

change_non_present_huge_pmd() skips write-permission changes for swap
entries and only updates uffd_wp, matching change_softleaf_pte().

move_soft_dirty_pmd(), clear_soft_dirty_pmd(), and make_uffd_wp_pmd(),
pagemap_pmd_range_thp() and change_huge_pmd() handle swap entries
alongside migration entries.

madvise_cold_or_pageout_pmd_range() extends its non-present PMD
VM_BUG_ON to allow swap entries; without this, hitting a PMD swap
entry on a DEBUG_VM kernel would BUG().

queue_folios_pmd() in mempolicy silently skips swap entries, matching
the PTE walker which only counts migration entries as failures.
Without this, mbind(MPOL_MF_STRICT) would spuriously return -EIO on
a swapped-out THP.

madvise_free_huge_pmd() handles PMD swap entries directly: for a
full-range MADV_FREE it clears the PMD, frees the deposited page
table, and releases the swap slots; for a partial range it splits to
PTE swap entries. Without this, MADV_FREE silently becomes a no-op
on swapped-out THPs, leaking swap slots.

hmm_vma_handle_absent_pmd() faults in PMD swap entries via
hmm_vma_fault() instead of returning -EFAULT. The first per-page
handle_mm_fault() call triggers do_huge_pmd_swap_page(), which maps
the entire folio; subsequent calls become harmless
huge_pmd_set_accessed() and the walker retries with a present PMD.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/proc/task_mmu.c | 43 +++++++++++++++++++++-------------
 mm/hmm.c           |  3 ++-
 mm/huge_memory.c   | 58 +++++++++++++++++++++++++++++++++++-----------
 mm/khugepaged.c    |  6 +++++
 mm/madvise.c       |  5 ++--
 mm/mempolicy.c     |  2 ++
 6 files changed, 85 insertions(+), 32 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6d9f43881e62..a6dd91d4cf24 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1015,6 +1015,23 @@ static void smaps_pte_hole_lookup(unsigned long addr, struct mm_walk *walk)
 #endif
 }
 
+static void smaps_account_swap(struct mem_size_stats *mss,
+		softleaf_t entry, unsigned long size)
+{
+	int mapcount;
+
+	mss->swap += size;
+	mapcount = swp_swapcount(entry);
+	if (mapcount >= 2) {
+		u64 pss_delta = (u64)size << PSS_SHIFT;
+
+		do_div(pss_delta, mapcount);
+		mss->swap_pss += pss_delta;
+	} else {
+		mss->swap_pss += (u64)size << PSS_SHIFT;
+	}
+}
+
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		struct mm_walk *walk)
 {
@@ -1036,18 +1053,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		const softleaf_t entry = softleaf_from_pte(ptent);
 
 		if (softleaf_is_swap(entry)) {
-			int mapcount;
-
-			mss->swap += PAGE_SIZE;
-			mapcount = swp_swapcount(entry);
-			if (mapcount >= 2) {
-				u64 pss_delta = (u64)PAGE_SIZE << PSS_SHIFT;
-
-				do_div(pss_delta, mapcount);
-				mss->swap_pss += pss_delta;
-			} else {
-				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
-			}
+			smaps_account_swap(mss, entry, PAGE_SIZE);
 		} else if (softleaf_has_pfn(entry)) {
 			if (softleaf_is_device_private(entry))
 				present = true;
@@ -1077,9 +1083,13 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (pmd_present(*pmd)) {
 		page = vm_normal_page_pmd(vma, addr, *pmd);
 		present = true;
-	} else if (unlikely(thp_migration_supported())) {
+	} else {
 		const softleaf_t entry = softleaf_from_pmd(*pmd);
 
+		if (softleaf_is_swap(entry)) {
+			smaps_account_swap(mss, entry, HPAGE_PMD_SIZE);
+			return;
+		}
 		if (softleaf_has_pfn(entry))
 			page = softleaf_to_page(entry);
 	}
@@ -1665,7 +1675,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		pmd = pmd_clear_soft_dirty(pmd);
 
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
-	} else if (pmd_is_migration_entry(pmd)) {
+	} else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) {
 		pmd = pmd_swp_clear_soft_dirty(pmd);
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
 	}
@@ -2025,7 +2035,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
 			flags |= PM_UFFD_WP;
 		if (pm->show_pfn)
 			frame = pmd_pfn(pmd) + idx;
-	} else if (thp_migration_supported()) {
+	} else if (pmd_is_swap_entry(pmd) ||
+		   (thp_migration_supported() && pmd_is_migration_entry(pmd))) {
 		const softleaf_t entry = softleaf_from_pmd(pmd);
 		unsigned long offset;
 
@@ -2463,7 +2474,7 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
 		old = pmdp_invalidate_ad(vma, addr, pmdp);
 		pmd = pmd_mkuffd_wp(old);
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
-	} else if (pmd_is_migration_entry(pmd)) {
+	} else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) {
 		pmd = pmd_swp_mkuffd_wp(pmd);
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
 	}
diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83d..2bd3ebd1b8d6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -370,7 +370,8 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start,
 	required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns,
 					      npages, 0);
 	if (required_fault) {
-		if (softleaf_is_device_private(entry))
+		if (softleaf_is_device_private(entry) ||
+		    softleaf_is_swap(entry))
 			return hmm_vma_fault(addr, end, required_fault, walk);
 		else
 			return -EFAULT;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 42887cf518cd..109e4dc4a167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2375,6 +2375,14 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+	pgtable_t pgtable;
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pte_free(mm, pgtable);
+	mm_dec_nr_ptes(mm);
+}
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -2399,8 +2407,23 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		goto out;
 
 	if (unlikely(!pmd_present(orig_pmd))) {
+		if (pmd_is_swap_entry(orig_pmd)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				__split_huge_pmd(vma, pmd, addr, false);
+				goto out_unlocked;
+			}
+			softleaf_t sl = softleaf_from_pmd(orig_pmd);
+
+			pmdp_huge_get_and_clear(mm, addr, pmd);
+			zap_deposited_table(mm, pmd);
+			spin_unlock(ptl);
+			swap_put_entries_direct(sl, HPAGE_PMD_NR);
+			add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+			return true;
+		}
 		VM_BUG_ON(thp_migration_supported() &&
-				  !pmd_is_migration_entry(orig_pmd));
+			  !pmd_is_migration_entry(orig_pmd));
 		goto out;
 	}
 
@@ -2449,15 +2472,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
-static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
-{
-	pgtable_t pgtable;
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pte_free(mm, pgtable);
-	mm_dec_nr_ptes(mm);
-}
-
 static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t pmdval, struct folio *folio, bool is_present)
 {
@@ -2550,6 +2564,16 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	arch_check_zapped_pmd(vma, orig_pmd);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 
+	if (pmd_is_swap_entry(orig_pmd)) {
+		softleaf_t sl = softleaf_from_pmd(orig_pmd);
+
+		zap_deposited_table(mm, pmd);
+		spin_unlock(ptl);
+		swap_put_entries_direct(sl, HPAGE_PMD_NR);
+		add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+		return true;
+	}
+
 	is_present = pmd_present(orig_pmd);
 	folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
 	has_deposit = has_deposited_pgtable(vma, orig_pmd, folio);
@@ -2582,7 +2606,8 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 static pmd_t move_soft_dirty_pmd(pmd_t pmd)
 {
 	if (pgtable_supports_soft_dirty()) {
-		if (unlikely(pmd_is_migration_entry(pmd)))
+		if (unlikely(pmd_is_migration_entry(pmd) ||
+			     pmd_is_swap_entry(pmd)))
 			pmd = pmd_swp_mksoft_dirty(pmd);
 		else if (pmd_present(pmd))
 			pmd = pmd_mksoft_dirty(pmd);
@@ -2662,7 +2687,14 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
 	pmd_t newpmd;
 
 	VM_WARN_ON(!pmd_is_valid_softleaf(*pmd));
-	if (softleaf_is_migration_write(entry)) {
+
+	/*
+	 * PMD swap entries don't encode write permission in the entry type,
+	 * so only uffd_wp flag changes apply. No folio lookup needed.
+	 */
+	if (softleaf_is_swap(entry)) {
+		newpmd = *pmd;
+	} else if (softleaf_is_migration_write(entry)) {
 		const struct folio *folio = softleaf_to_folio(entry);
 
 		/*
@@ -2719,7 +2751,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (!ptl)
 		return 0;
 
-	if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) {
+	if (pmd_is_valid_softleaf(*pmd)) {
 		change_non_present_huge_pmd(mm, addr, pmd, uffd_wp,
 					    uffd_wp_resolve);
 		goto unlock;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b8452dbdb043..a7cc65c3d06a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -950,6 +950,12 @@ static inline enum scan_result check_pmd_state(pmd_t *pmd)
 	 */
 	if (pmd_is_migration_entry(pmde))
 		return SCAN_PMD_MAPPED;
+	/*
+	 * A PMD-mapped THP that has been swapped out is still a THP from
+	 * khugepaged's perspective; treat it like a present huge PMD.
+	 */
+	if (pmd_is_swap_entry(pmde))
+		return SCAN_PMD_MAPPED;
 	if (!pmd_present(pmde))
 		return SCAN_NO_PTE_TABLE;
 	if (pmd_trans_huge(pmde))
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..2702eb0b1134 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -390,7 +390,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 
 		if (unlikely(!pmd_present(orig_pmd))) {
 			VM_BUG_ON(thp_migration_supported() &&
-					!pmd_is_migration_entry(orig_pmd));
+					!pmd_is_migration_entry(orig_pmd) &&
+					!pmd_is_swap_entry(orig_pmd));
 			goto huge_unlock;
 		}
 
@@ -666,7 +667,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr, max_nr;
 
 	next = pmd_addr_end(addr, end);
-	if (pmd_trans_huge(*pmd))
+	if (pmd_trans_huge(*pmd) || pmd_is_swap_entry(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			return 0;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59..55b38fe13a63 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -658,6 +658,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 		qp->nr_failed++;
 		return;
 	}
+	if (unlikely(pmd_is_swap_entry(*pmd)))
+		return;
 	folio = pmd_folio(*pmd);
 	if (is_huge_zero_folio(folio)) {
 		walk->action = ACTION_CONTINUE;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (8 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
@ 2026-04-27 10:01 ` Usama Arif
  2026-04-27 10:02 ` [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Usama Arif
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:01 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

move_pages_huge_pmd() returned -ENOENT for any non-trans_huge,
non-migration PMD, which fails aligned UFFDIO_MOVE on a swapped-out
THP -- the PMD swap entry is a perfectly valid mapping that should
move whole. Splitting via the move_pages_ptes() fallback isn't a
substitute either: __split_huge_pmd_locked() splits a PMD swap entry
into HPAGE_PMD_NR PTE swap entries pointing at the same swap-cache
folio, but move_swap_pte() refuses any swap-cache folio that is still
large and returns -EBUSY.

Add move_swap_pmd(), modeled on move_swap_pte(), that moves the swap
entry whole-PMD and re-anchors the swap-cache folio's anon rmap to
the destination VMA. Reject !pmd_swp_exclusive() entries with -EBUSY
to preserve UFFDIO_MOVE's single-owner semantics, propagate
soft-dirty, and carry the deposited page table across with the
entry.

The dispatcher in move_pages_huge_pmd() now waits for migration on a
PMD migration entry (matching the PTE path) and routes PMD swap
entries through move_swap_pmd() after pinning the swap device,
fetching and locking any cached folio, and arming an mmu_notifier
range so secondary MMUs see the move.

If the swap-cache folio was split (e.g. by deferred_split_scan or
memory_failure) between swap-out and UFFDIO_MOVE, src_folio is no
longer PMD-sized but the PMD swap entry still covers all 512 slots.
Moving the entry whole would only re-anchor one folio's anon rmap,
leaving the other 511 with a stale anon_vma. Return -EBUSY in this
case, matching move_pages_pte()'s rejection of large folios, so the
caller falls back to PTE-level moves.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 112 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 109e4dc4a167..bfcc9b274be7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2871,6 +2871,62 @@ int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 #endif
 
 #ifdef CONFIG_USERFAULTFD
+/*
+ * Move a PMD-level swap entry from src_pmd to dst_pmd. Both PMD locks are
+ * acquired here; src_folio (if present) must already be locked. The deposited
+ * page table backing the source THP is moved across with the entry.
+ */
+static int move_swap_pmd(struct mm_struct *mm, struct vm_area_struct *dst_vma,
+			 unsigned long dst_addr, unsigned long src_addr,
+			 pmd_t *dst_pmd, pmd_t *src_pmd,
+			 pmd_t orig_dst_pmd, pmd_t orig_src_pmd,
+			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
+			 struct folio *src_folio, swp_entry_t entry)
+{
+	pgtable_t src_pgtable;
+	pmd_t moved_pmd;
+
+	/*
+	 * The folio may have been freed and reused for a different swap entry
+	 * while it was unlocked. Re-verify the association.
+	 */
+	if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+				  entry.val != src_folio->swap.val))
+		return -EAGAIN;
+
+	double_pt_lock(dst_ptl, src_ptl);
+
+	if (!pmd_same(*src_pmd, orig_src_pmd) ||
+	    !pmd_same(*dst_pmd, orig_dst_pmd)) {
+		double_pt_unlock(dst_ptl, src_ptl);
+		return -EAGAIN;
+	}
+
+	/*
+	 * If the folio is in the swap cache, re-anchor its anon rmap to the
+	 * destination VMA so a future swap-in fault at dst_addr finds it.
+	 * Otherwise, re-check that no folio was newly inserted under us.
+	 */
+	if (src_folio) {
+		folio_move_anon_rmap(src_folio, dst_vma);
+		src_folio->index = linear_page_index(dst_vma, dst_addr);
+	} else if (swap_cache_has_folio(entry)) {
+		double_pt_unlock(dst_ptl, src_ptl);
+		return -EAGAIN;
+	}
+
+	moved_pmd = pmdp_huge_get_and_clear(mm, src_addr, src_pmd);
+	if (pgtable_supports_soft_dirty())
+		moved_pmd = pmd_swp_mksoft_dirty(moved_pmd);
+	set_pmd_at(mm, dst_addr, dst_pmd, moved_pmd);
+
+	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+
+	double_pt_unlock(dst_ptl, src_ptl);
+	return 0;
+}
+
 /*
  * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
  * the caller, but it must return after releasing the page_table_lock.
@@ -2905,11 +2961,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	}
 
 	if (!pmd_trans_huge(src_pmdval)) {
-		spin_unlock(src_ptl);
 		if (pmd_is_migration_entry(src_pmdval)) {
+			spin_unlock(src_ptl);
 			pmd_migration_entry_wait(mm, &src_pmdval);
 			return -EAGAIN;
 		}
+		if (pmd_is_swap_entry(src_pmdval)) {
+			swp_entry_t entry;
+			struct swap_info_struct *si;
+
+			/*
+			 * UFFDIO_MOVE on anon mappings requires single-owner
+			 * semantics; refuse to move a shared swap entry.
+			 */
+			if (!pmd_swp_exclusive(src_pmdval)) {
+				spin_unlock(src_ptl);
+				return -EBUSY;
+			}
+
+			entry = softleaf_from_pmd(src_pmdval);
+			spin_unlock(src_ptl);
+
+			/* Pin the swap device against a racing swapoff. */
+			si = get_swap_device(entry);
+			if (unlikely(!si))
+				return -EAGAIN;
+
+			src_folio = swap_cache_get_folio(entry);
+
+			mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
+						mm, src_addr,
+						src_addr + HPAGE_PMD_SIZE);
+			mmu_notifier_invalidate_range_start(&range);
+
+			if (src_folio) {
+				folio_lock(src_folio);
+				if (folio_nr_pages(src_folio) != HPAGE_PMD_NR) {
+					err = -EBUSY;
+					folio_unlock(src_folio);
+					folio_put(src_folio);
+					mmu_notifier_invalidate_range_end(&range);
+					put_swap_device(si);
+					return err;
+				}
+			}
+
+			dst_ptl = pmd_lockptr(mm, dst_pmd);
+			err = move_swap_pmd(mm, dst_vma, dst_addr, src_addr,
+					    dst_pmd, src_pmd, dst_pmdval,
+					    src_pmdval, dst_ptl, src_ptl,
+					    src_folio, entry);
+
+			mmu_notifier_invalidate_range_end(&range);
+			if (src_folio) {
+				folio_unlock(src_folio);
+				folio_put(src_folio);
+			}
+			put_swap_device(si);
+			return err;
+		}
+		spin_unlock(src_ptl);
 		return -ENOENT;
 	}
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 11/13] mm: handle PMD swap entry faults on swap-in
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (9 preceding siblings ...)
  2026-04-27 10:01 ` [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
@ 2026-04-27 10:02 ` Usama Arif
  2026-04-27 10:02 ` [PATCH 12/13] mm: install PMD swap entries on swap-out Usama Arif
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:02 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Add do_huge_pmd_swap_page() and dispatch to it from __handle_mm_fault()
when vmf->orig_pmd encodes a swap entry.  The handler resolves the
entire 2 MB mapping in one shot, mirroring do_swap_page() (PTE path)
at PMD granularity:

  - Look up the folio in the swap cache; on a miss, allocate a
    PMD-order folio and read from swap (shared with unuse_pmd_entry()
    via swapin_alloc_pmd_folio() in mm/swap_state.c).

  - After locking, re-validate that the folio still corresponds to our
    entry and is still PMD-sized.  Between the unlocked cache lookup
    and the lock, a racing swap-in on the same entry may have removed
    it from the cache via folio_free_swap(), or reclaim / memory_failure
    / deferred-split may have split the folio into smaller folios.

  - Restore soft_dirty and uffd_wp from the swap PMD.  Map writable
    only when the entry was exclusive, the VMA permits writes, and
    uffd-wp is not armed.  Drop the exclusive marker when the cached
    folio is under writeback to an SWP_STABLE_WRITES backend (zram,
    encrypted) so the PMD is mapped read-only; a later write COWs
    into a fresh folio rather than corrupting the in-flight writeback.
    Mirrors do_swap_page().

  - When the resulting PMD is read-only but the fault was a write,
    update vmf->orig_pmd and call wp_huge_pmd() in the same handler
    to COW immediately rather than forcing a second fault.  Mask
    VM_FAULT_FALLBACK from its return: a PMD-COW that splits to
    PTE-level is normal, but the bit is part of VM_FAULT_ERROR and
    arch fault handlers BUG() on it without SIGBUS/HWPOISON/SIGSEGV.
    Requires exposing wp_huge_pmd() via mm/internal.h.

  - Free the swap slot via should_try_to_free_swap() (hoisted from
    mm/memory.c into mm/internal.h so PTE- and PMD-level swap-in
    share the heuristic).

When PMD-order resources are unavailable (folio allocation fails,
the cached folio was split, memcg charge fails, or swapin_folio()
races) split the PMD swap entry into 512 PTE swap entries via
__split_huge_pmd() and return 0.  The fault retries and do_swap_page()
takes over per-PTE.  This avoids returning VM_FAULT_OOM for transient
PMD-order allocation failures.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h |   9 ++
 mm/huge_memory.c        | 197 ++++++++++++++++++++++++++++++++++++++++
 mm/internal.h           |  36 ++++++++
 mm/memory.c             |  40 +-------
 mm/swap_state.c         |   2 +-
 5 files changed, 247 insertions(+), 37 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2949e5acff35..93ee6c36d6ea 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -522,6 +522,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
 
+#ifdef CONFIG_THP_SWAP
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+#else
+static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+	return 0;
+}
+#endif
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bfcc9b274be7..141ab45adee4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2375,6 +2375,203 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_THP_SWAP
+/**
+ * do_huge_pmd_swap_page() - Handle a fault on a PMD-level swap entry.
+ * @vmf: Fault context. vmf->orig_pmd contains the swap PMD.
+ *
+ * Looks up the folio in the swap cache, and if it is a PMD-sized folio,
+ * maps it directly at the PMD level. If the folio is not in the swap
+ * cache, allocates a PMD-sized folio and reads from swap. On allocation
+ * failure, splits the PMD swap entry into PTE-level entries and retries
+ * at PTE granularity.
+ *
+ * Return: VM_FAULT_* flags.
+ */
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct folio *folio;
+	struct page *page;
+	struct swap_info_struct *si;
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	softleaf_t entry;
+	swp_entry_t swp_entry;
+	pmd_t pmd;
+	vm_fault_t ret = 0;
+	bool exclusive;
+	rmap_t rmap_flags = RMAP_NONE;
+
+	entry = softleaf_from_pmd(vmf->orig_pmd);
+	if (unlikely(!softleaf_is_swap(entry)))
+		return 0;
+
+	swp_entry = entry;
+
+	/* Prevent swapoff from happening to us. */
+	si = get_swap_device(swp_entry);
+	if (unlikely(!si))
+		return 0;
+
+	folio = swap_cache_get_folio(swp_entry);
+	if (!folio) {
+		folio = swapin_alloc_pmd_folio(swp_entry, mm);
+		if (!folio)
+			goto split_fallback;
+
+		/* Had to read from swap area: Major fault */
+		ret = VM_FAULT_MAJOR;
+		count_vm_event(PGMAJFAULT);
+		count_memcg_event_mm(mm, PGMAJFAULT);
+	}
+
+	ret |= folio_lock_or_retry(folio, vmf);
+	if (ret & VM_FAULT_RETRY)
+		goto out_release;
+
+	/* Verify the folio is still in swap cache and matches our entry */
+	if (unlikely(!folio_matches_swap_entry(folio, swp_entry)))
+		goto out_page;
+
+	/*
+	 * Folio should be PMD-sized; if not (e.g. split in swap cache),
+	 * split the PMD swap entry and retry at PTE level.
+	 */
+	if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+		folio_unlock(folio);
+		folio_put(folio);
+		goto split_fallback;
+	}
+
+	if (unlikely(!folio_test_uptodate(folio))) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_page;
+	}
+
+	page = folio_page(folio, 0);
+	arch_swap_restore(folio_swap(swp_entry, folio), folio);
+
+	if ((vmf->flags & FAULT_FLAG_WRITE) && !folio_test_lru(folio))
+		lru_add_drain();
+
+	folio_throttle_swaprate(folio, GFP_KERNEL);
+
+	/* Lock the PMD and verify it hasn't changed */
+	vmf->ptl = pmd_lock(mm, vmf->pmd);
+	if (unlikely(!pmd_same(vmf->orig_pmd, pmdp_get(vmf->pmd)))) {
+		spin_unlock(vmf->ptl);
+		goto out_page;
+	}
+
+	exclusive = pmd_swp_exclusive(vmf->orig_pmd);
+
+	/*
+	 * Some swap backends (e.g. zram) don't support concurrent page
+	 * modifications while under writeback. If we map exclusive on such
+	 * a backend while the folio is still under writeback, the writeback
+	 * may see partial modifications and corrupt the swap slot. Drop the
+	 * exclusive marker and only map R/O for that case; further GUP
+	 * references can't appear once the page is fully unmapped, so this
+	 * is safe.
+	 */
+	if (exclusive && folio_test_writeback(folio) &&
+	    data_race(si->flags & SWP_STABLE_WRITES))
+		exclusive = false;
+
+	/*
+	 * Set up the PMD mapping. Similar to do_swap_page() but at PMD level.
+	 */
+	add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+	pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+	pmd = pmd_mkyoung(pmd);
+
+	if (pmd_swp_soft_dirty(vmf->orig_pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+	if (pmd_swp_uffd_wp(vmf->orig_pmd))
+		pmd = pmd_mkuffd_wp(pmd);
+
+	/*
+	 * Check exclusivity to determine if we can map writable.
+	 */
+	if (exclusive || folio_ref_count(folio) == 1) {
+		if ((vma->vm_flags & VM_WRITE) &&
+		    !userfaultfd_huge_pmd_wp(vma, pmd) &&
+		    !pmd_needs_soft_dirty_wp(vma, pmd)) {
+			pmd = pmd_mkwrite(pmd, vma);
+			if (vmf->flags & FAULT_FLAG_WRITE) {
+				pmd = pmd_mkdirty(pmd);
+				vmf->flags &= ~FAULT_FLAG_WRITE;
+			}
+		}
+		rmap_flags |= RMAP_EXCLUSIVE;
+	}
+
+	flush_icache_pages(vma, page, HPAGE_PMD_NR);
+
+	if (!folio_test_anon(folio))
+		folio_add_new_anon_rmap(folio, vma, haddr, rmap_flags);
+	else
+		folio_add_anon_rmap_pmd(folio, page, vma, haddr, rmap_flags);
+
+	folio_put_swap(folio, NULL);
+
+	set_pmd_at(mm, haddr, vmf->pmd, pmd);
+	update_mmu_cache_pmd(vma, haddr, vmf->pmd);
+
+	/* Update orig_pmd for any follow-up wp_huge_pmd() below. */
+	vmf->orig_pmd = pmd;
+
+	/*
+	 * Conditionally try to free up the swap cache. Do it after mapping,
+	 * so raced page faults will likely see the folio in swap cache and
+	 * wait on the folio lock.
+	 */
+	if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags))
+		folio_free_swap(folio);
+
+	spin_unlock(vmf->ptl);
+
+	folio_unlock(folio);
+	put_swap_device(si);
+
+	/*
+	 * If the write fault wasn't satisfied above (folio is shared without
+	 * exclusivity), fall through to wp_huge_pmd to handle COW or
+	 * userfaultfd-wp without forcing a second fault.
+	 *
+	 * wp_huge_pmd() may return VM_FAULT_FALLBACK if it had to split the
+	 * PMD; that's a normal outcome — the natural PTE-level refault will
+	 * complete the COW. Mask it so callers (and the arch fault handler)
+	 * don't see VM_FAULT_FALLBACK as a fatal VM_FAULT_ERROR.
+	 */
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		vm_fault_t wp_ret = wp_huge_pmd(vmf);
+
+		wp_ret &= ~VM_FAULT_FALLBACK;
+		ret |= wp_ret;
+		if (ret & VM_FAULT_ERROR)
+			ret &= VM_FAULT_ERROR;
+	}
+
+	return ret;
+
+out_page:
+	folio_unlock(folio);
+out_release:
+	folio_put(folio);
+	put_swap_device(si);
+	return ret;
+
+split_fallback:
+	__split_huge_pmd(vma, vmf->pmd, haddr, false);
+	put_swap_device(si);
+	return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
 static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t pgtable;
diff --git a/mm/internal.h b/mm/internal.h
index 7de489689f54..c522bff72688 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -508,6 +508,42 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
 }
 
 vm_fault_t do_swap_page(struct vm_fault *vmf);
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf);
+
+/*
+ * Check if we should call folio_free_swap to free the swap cache.
+ * folio_free_swap only frees the swap cache to release the slot if swap
+ * count is zero, so we don't need to check the swap count here.
+ */
+static inline bool should_try_to_free_swap(struct swap_info_struct *si,
+					   struct folio *folio,
+					   struct vm_area_struct *vma,
+					   unsigned int extra_refs,
+					   unsigned int fault_flags)
+{
+	if (!folio_test_swapcache(folio))
+		return false;
+	/*
+	 * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
+	 * cache can help save some IO or memory overhead, but these devices
+	 * are fast, and meanwhile, swap cache pinning the slot deferring the
+	 * release of metadata or fragmentation is a more critical issue.
+	 */
+	if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		return true;
+	if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
+	    folio_test_mlocked(folio))
+		return true;
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * user. Try freeing the swapcache to get rid of the swapcache
+	 * reference only in case it's likely that we'll be the exclusive user.
+	 */
+	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
+		folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+}
+
 void folio_rotate_reclaimable(struct folio *folio);
 bool __folio_end_writeback(struct folio *folio);
 void deactivate_file_folio(struct folio *folio);
diff --git a/mm/memory.c b/mm/memory.c
index 8aa90afd601a..3006e1bc2bd7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4481,40 +4481,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
-/*
- * Check if we should call folio_free_swap to free the swap cache.
- * folio_free_swap only frees the swap cache to release the slot if swap
- * count is zero, so we don't need to check the swap count here.
- */
-static inline bool should_try_to_free_swap(struct swap_info_struct *si,
-					   struct folio *folio,
-					   struct vm_area_struct *vma,
-					   unsigned int extra_refs,
-					   unsigned int fault_flags)
-{
-	if (!folio_test_swapcache(folio))
-		return false;
-	/*
-	 * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
-	 * cache can help save some IO or memory overhead, but these devices
-	 * are fast, and meanwhile, swap cache pinning the slot deferring the
-	 * release of metadata or fragmentation is a more critical issue.
-	 */
-	if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
-		return true;
-	if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
-	    folio_test_mlocked(folio))
-		return true;
-	/*
-	 * If we want to map a page that's in the swapcache writable, we
-	 * have to detect via the refcount if we're really the exclusive
-	 * user. Try freeing the swapcache to get rid of the swapcache
-	 * reference only in case it's likely that we'll be the exclusive user.
-	 */
-	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-		folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
-}
-
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 {
 	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -6233,8 +6199,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 	return VM_FAULT_FALLBACK;
 }
 
-/* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
@@ -6518,6 +6483,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 		if (pmd_is_migration_entry(vmf.orig_pmd))
 			pmd_migration_entry_wait(mm, vmf.pmd);
+		else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+			 pmd_is_swap_entry(vmf.orig_pmd))
+			return do_huge_pmd_swap_page(&vmf);
 		return 0;
 	}
 	if (pmd_trans_huge(vmf.orig_pmd)) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c2e8c76658f5..19c6759006bb 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -592,7 +592,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
  *
  * Allocate a HPAGE_PMD_ORDER folio, charge it to @mm's memcg for @entry, and
  * issue the swap-in via swapin_folio(). Used by callers that need to map a
- * PMD swap entry as a whole THP (PMD swapoff).
+ * PMD swap entry as a whole THP (PMD swap-in fault and swapoff).
  *
  * Return: the swapped-in folio, or NULL on alloc/charge/swapin failure (in
  * which case the caller should fall back to splitting the PMD).
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 12/13] mm: install PMD swap entries on swap-out
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (10 preceding siblings ...)
  2026-04-27 10:02 ` [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Usama Arif
@ 2026-04-27 10:02 ` Usama Arif
  2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:02 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap
entries before unmap, losing the huge mapping across the swap
round-trip and forcing khugepaged to rebuild it later.  The
contiguous swap range was already secured when the folio was added
to the swap cache (a non-contiguous allocation would have split the
folio earlier), so the PMD can be replaced by a single PMD-level
swap entry instead.
This patch mirrors the existing PTE swap-out path at PMD
granularity:
- shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable
  swapcache folios, gated on zswap_never_enabled() since zswap
  cannot reconstruct a 2 MB folio from per-page blobs (Best
  to handle zswap case separately).
- try_to_unmap_one() now has a PMD branch that calls
  set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by
  HPAGE_PMD_NR before walk_done.  TTU_SPLIT_HUGE_PMD remains the
  fallback.
- set_pmd_swap_entry() is the installer.  Mirroring the PTE
  swap-out sequence at PMD granularity, it clears the present
  mapping (keeping the original for rollback), bumps the swap_map
  refcount for the folio's 512 slots, drops the exclusive mark if
  the page was anon-exclusive, propagates the dirty bit to the
  folio so writeback is not lost, and installs a swap PMD that
  preserves the original soft-dirty / uffd-wp / exclusive bits.
  Any failing step rolls back the present mapping.

The swap entry value matches what 512 PTE swap entries would
encode, so swap_map refcounting is unchanged: each of the 512 slots
carries a count of 1, released individually on later split or
together on swap-in.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h       |  2 +
 include/linux/vm_event_item.h |  1 +
 mm/huge_memory.c              | 78 +++++++++++++++++++++++++++++++++++
 mm/rmap.c                     | 20 +++++++++
 mm/vmscan.c                   | 14 ++++++-
 mm/vmstat.c                   |  1 +
 6 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93ee6c36d6ea..cbfac4720fc9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -524,6 +524,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
 
 #ifdef CONFIG_THP_SWAP
 vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+		       struct folio *folio);
 #else
 static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
 {
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7267c06674c0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
 		THP_SWPOUT_FALLBACK,
+		THP_SWPOUT_PMD,
 #endif
 #ifdef CONFIG_BALLOON
 		BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 141ab45adee4..47ff7fb9ee9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -5497,3 +5497,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	trace_remove_migration_pmd(address, pmd_val(pmde));
 }
 #endif
+
+#ifdef CONFIG_THP_SWAP
+/**
+ * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry.
+ * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and
+ *        pvmw->pte NULL (i.e. PMD-mapped).
+ * @folio: The folio being swapped out. Must be in the swap cache.
+ *
+ * This installs a PMD-level swap entry in place of a present PMD mapping,
+ * avoiding the need to split the PMD into PTE-level swap entries.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+		       struct folio *folio)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *page = folio_page(folio, 0);
+	bool anon_exclusive;
+	pmd_t pmdval;
+	swp_entry_t entry;
+	pmd_t pmdswp;
+
+	if (!(pvmw->pmd && !pvmw->pte))
+		return 0;
+
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+	VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+
+	if (unlikely(folio_test_swapbacked(folio) !=
+			folio_test_swapcache(folio))) {
+		WARN_ON_ONCE(1);
+		return -EBUSY;
+	}
+
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+	pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd);
+
+	/* Update high watermark before we lower rss */
+	update_hiwater_rss(mm);
+
+	if (folio_dup_swap(folio, NULL) < 0) {
+		set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+		return -ENOMEM;
+	}
+
+	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+	anon_exclusive = PageAnonExclusive(page);
+	if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
+		folio_put_swap(folio, NULL);
+		set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+		return -EBUSY;
+	}
+
+	if (pmd_dirty(pmdval))
+		folio_mark_dirty(folio);
+
+	entry = folio->swap;
+	pmdswp = softleaf_to_pmd(entry);
+	if (pmd_soft_dirty(pmdval))
+		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+	if (pmd_uffd_wp(pmdval))
+		pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+	if (anon_exclusive)
+		pmdswp = pmd_swp_mkexclusive(pmdswp);
+	set_pmd_at(mm, haddr, pvmw->pmd, pmdswp);
+
+	folio_remove_rmap_pmd(folio, page, vma);
+	folio_put(folio);
+
+	count_vm_event(THP_SWPOUT_PMD);
+	return 0;
+}
+#endif /* CONFIG_THP_SWAP */
diff --git a/mm/rmap.c b/mm/rmap.c
index 057e18cb80b0..b188213648c5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2077,6 +2077,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto walk_abort;
 			}
 
+#ifdef CONFIG_THP_SWAP
+			/*
+			 * If the folio is in the swap cache and we're not
+			 * asked to split, install a PMD-level swap entry.
+			 */
+			if (!(flags & TTU_SPLIT_HUGE_PMD) &&
+			    folio_test_anon(folio) &&
+			    folio_test_swapcache(folio)) {
+				if (set_pmd_swap_entry(&pvmw, folio))
+					goto walk_abort;
+
+				ensure_on_mmlist(mm);
+				add_mm_counter(mm, MM_ANONPAGES,
+					       -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_SWAPENTS,
+					       HPAGE_PMD_NR);
+				goto walk_done;
+			}
+#endif
+
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				/*
 				 * We temporarily have to drop the PTL and
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..e895aaade8f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -64,6 +64,7 @@
 
 #include <linux/swapops.h>
 #include <linux/sched/sysctl.h>
+#include <linux/zswap.h>
 
 #include "internal.h"
 #include "swap.h"
@@ -1330,7 +1331,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
 			bool was_swapbacked = folio_test_swapbacked(folio);
 
-			if (folio_test_pmd_mappable(folio))
+			/*
+			 * With THP_SWAP, PMD-mappable folios already in the
+			 * swap cache can be unmapped with a PMD-level swap
+			 * entry, avoiding the cost of splitting the PMD.
+			 * Skip this when zswap has been enabled because
+			 * zswap stores pages individually and cannot
+			 * reconstruct a large folio on swap-in.
+			 */
+			if (folio_test_pmd_mappable(folio) &&
+			    !(IS_ENABLED(CONFIG_THP_SWAP) &&
+			      folio_test_swapcache(folio) &&
+			      zswap_never_enabled()))
 				flags |= TTU_SPLIT_HUGE_PMD;
 			/*
 			 * Without TTU_SYNC, try_to_unmap will only begin to
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..9b4963a7eb04 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
 	[I(THP_ZERO_PAGE_ALLOC_FAILED)]		= "thp_zero_page_alloc_failed",
 	[I(THP_SWPOUT)]				= "thp_swpout",
 	[I(THP_SWPOUT_FALLBACK)]		= "thp_swpout_fallback",
+	[I(THP_SWPOUT_PMD)]			= "thp_swpout_pmd",
 #endif
 #ifdef CONFIG_BALLOON
 	[I(BALLOON_INFLATE)]			= "balloon_inflate",
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 13/13] selftests/mm: add PMD swap entry tests
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (11 preceding siblings ...)
  2026-04-27 10:02 ` [PATCH 12/13] mm: install PMD swap entries on swap-out Usama Arif
@ 2026-04-27 10:02 ` Usama Arif
  2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
  2026-04-27 18:26 ` Zi Yan
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 10:02 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Usama Arif

Exercise the PMD swap entry paths. The tests allocate a PMD-mapped
THP, write a known pattern, swap it out via MADV_PAGEOUT, and then
exercise different code paths:

 - swap-out / swap-in round-trip with data verification
 - fork with read-only access from both parent and child
 - fork with writes in both processes to verify COW isolation
 - repeated swap cycles to try and catch reference counting issues
 - write fault on a swapped PMD to verify dirty handling
 - munmap of a swapped PMD (zap_huge_pmd swap slot cleanup)
 - mprotect on a swapped PMD (change_non_present_huge_pmd)
 - mremap of a swapped PMD (move_soft_dirty_pmd)
 - pagemap reading (pagemap_pmd_range_thp softleaf_has_pfn guard)
 - MADV_FREE on a swapped PMD: verifies swap slots are freed via
   pagemap and the memory reads back as zero
 - UFFDIO_MOVE on a swapped PMD (move_pages_huge_pmd swap path);
   verifies the entry transfers without splitting and that the
   destination faults back in as a THP
 - swapoff with active PMD swap entries (unuse_pmd_range split)

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 tools/testing/selftests/mm/Makefile   |   1 +
 tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
 2 files changed, 608 insertions(+)
 create mode 100644 tools/testing/selftests/mm/pmd_swap.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27..3c753dba863f 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
 TEST_GEN_FILES += merge
 TEST_GEN_FILES += rmap
 TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += pmd_swap
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/pmd_swap.c b/tools/testing/selftests/mm/pmd_swap.c
new file mode 100644
index 000000000000..28147ddd824c
--- /dev/null
+++ b/tools/testing/selftests/mm/pmd_swap.c
@@ -0,0 +1,607 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test PMD-level swap entries.
+ *
+ * Verifies that when a PMD-mapped THP is swapped out the kernel installs
+ * a single PMD-level swap entry (instead of splitting into 512 PTE-level
+ * entries), and that operations on the swapped region behave correctly:
+ *   basic         - swap out + swap in preserves data
+ *   fork          - parent and child both see the data
+ *   fork_cow      - COW after fork keeps parent's data isolated
+ *   cycles        - repeated swap out/in does not corrupt data
+ *   write         - faulting in via a write keeps the rest of the THP
+ *   munmap        - munmap on a PMD swap entry frees swap slots cleanly
+ *   mprotect      - mprotect on a PMD swap entry preserves data
+ *   mremap        - mremap on a PMD swap entry preserves data
+ *   pagemap       - pagemap reports the entries as swapped
+ *   madvise_free  - MADV_FREE on a PMD swap entry does not crash
+ *   uffdio_move   - UFFDIO_MOVE moves a PMD swap entry whole-PMD
+ *   swapoff       - swapoff faults the THP back in (needs PMD_SWAP_DEVICE)
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdint.h>
+#include <sys/random.h>
+#include <sys/swap.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <time.h>
+
+#include "kselftest_harness.h"
+#include "vm_util.h"
+
+static bool check_swapped(int pagemap_fd, char *addr, unsigned long size)
+{
+	unsigned long off;
+
+	for (off = 0; off < size; off += getpagesize())
+		if (!pagemap_is_swapped(pagemap_fd, addr + off))
+			return false;
+	return true;
+}
+
+static bool swap_available(int pagemap_fd)
+{
+	char *p;
+	bool ret;
+
+	p = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (p == MAP_FAILED)
+		return false;
+
+	memset(p, 0xab, getpagesize());
+	madvise(p, getpagesize(), MADV_PAGEOUT);
+	ret = pagemap_is_swapped(pagemap_fd, p);
+	munmap(p, getpagesize());
+	return ret;
+}
+
+static unsigned long read_vm_event(const char *name)
+{
+	char line[256];
+	size_t name_len = strlen(name);
+	unsigned long val = 0;
+	FILE *f;
+
+	f = fopen("/proc/vmstat", "r");
+	if (!f)
+		return 0;
+	while (fgets(line, sizeof(line), f)) {
+		if (!strncmp(line, name, name_len) && line[name_len] == ' ') {
+			val = strtoul(line + name_len + 1, NULL, 10);
+			break;
+		}
+	}
+	fclose(f);
+	return val;
+}
+
+static unsigned int random_seed(void)
+{
+	unsigned int seed;
+
+	if (getrandom(&seed, sizeof(seed), 0) != sizeof(seed))
+		seed = (unsigned int)time(NULL);
+	return seed;
+}
+
+static unsigned char pattern_byte(unsigned int seed, unsigned long off)
+{
+	return (unsigned char)(seed + off);
+}
+
+static void fill_pattern(char *buf, unsigned long size, unsigned int seed)
+{
+	unsigned long i;
+
+	for (i = 0; i < size; i++)
+		buf[i] = (char)pattern_byte(seed, i);
+}
+
+static bool verify_pattern(char *buf, unsigned long size, unsigned int seed)
+{
+	unsigned long i;
+
+	for (i = 0; i < size; i++)
+		if ((unsigned char)buf[i] != pattern_byte(seed, i))
+			return false;
+	return true;
+}
+
+/*
+ * mmap a PMD-sized region, request THP, fill with a pattern, and swap
+ * it out. Verifies via the thp_swpout_pmd vmstat counter that the
+ * swap-out installed a PMD swap entry rather than splitting to PTEs.
+ */
+static char *alloc_fill_swap_thp(unsigned long pmd_size, int pagemap_fd,
+				 unsigned int seed)
+{
+	unsigned long pmd_before, pmd_after;
+	char *mem;
+
+	mem = mmap(NULL, pmd_size, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mem == MAP_FAILED)
+		return MAP_FAILED;
+
+	madvise(mem, pmd_size, MADV_HUGEPAGE);
+	fill_pattern(mem, pmd_size, seed);
+
+	pmd_before = read_vm_event("thp_swpout_pmd");
+
+	if (madvise(mem, pmd_size, MADV_PAGEOUT) ||
+	    !check_swapped(pagemap_fd, mem, pmd_size)) {
+		munmap(mem, pmd_size);
+		return MAP_FAILED;
+	}
+
+	pmd_after = read_vm_event("thp_swpout_pmd");
+	printf("# thp_swpout_pmd: %lu -> %lu\n", pmd_before, pmd_after);
+	if (pmd_after - pmd_before < 1) {
+		munmap(mem, pmd_size);
+		return MAP_FAILED;
+	}
+	return mem;
+}
+
+FIXTURE(pmd_swap)
+{
+	unsigned long pmd_size;
+	int pagemap_fd;
+	unsigned int seed;
+};
+
+FIXTURE_SETUP(pmd_swap)
+{
+	self->pagemap_fd = -1;
+
+	self->pmd_size = read_pmd_pagesize();
+	if (!self->pmd_size)
+		SKIP(return, "Cannot determine PMD size\n");
+
+	self->pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (self->pagemap_fd < 0)
+		SKIP(return, "Cannot open /proc/self/pagemap\n");
+
+	if (!swap_available(self->pagemap_fd))
+		SKIP(return, "Swap not available or not working\n");
+
+	self->seed = random_seed();
+}
+
+FIXTURE_TEARDOWN(pmd_swap)
+{
+	if (self->pagemap_fd >= 0)
+		close(self->pagemap_fd);
+}
+
+/*
+ * Allocate a PMD-sized THP, write a pattern, swap it out, read it back,
+ * verify the pattern.
+ */
+TEST_F(pmd_swap, basic)
+{
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Allocate a THP, swap it out, fork, verify both parent and child see
+ * the correct data.
+ */
+TEST_F(pmd_swap, fork)
+{
+	char *mem;
+	pid_t pid;
+	int status;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		_exit(verify_pattern(mem, self->pmd_size, self->seed) ? 0 : 1);
+	}
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Swap out, fork, then have parent and child write different patterns.
+ * Exercises COW on shared PMD swap entries: writes after fork must
+ * trigger copy-on-write so the parent's data stays isolated.
+ */
+TEST_F(pmd_swap, fork_cow)
+{
+	unsigned int parent_seed = self->seed;
+	unsigned int child_seed = ~self->seed;
+	char *mem;
+	pid_t pid;
+	int status;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, parent_seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		fill_pattern(mem, self->pmd_size, child_seed);
+		_exit(verify_pattern(mem, self->pmd_size, child_seed) ? 0 : 1);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, parent_seed));
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Swap a THP out and in repeatedly without data corruption.
+ */
+TEST_F(pmd_swap, cycles)
+{
+	const int num_cycles = 5;
+	char *mem;
+	int cycle;
+
+	for (cycle = 0; cycle < num_cycles; cycle++) {
+		unsigned int seed = self->seed + cycle;
+
+		mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+		if (mem == MAP_FAILED)
+			SKIP(return, "Could not create swapped THP at cycle %d\n",
+			     cycle);
+
+		ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed));
+
+		munmap(mem, self->pmd_size);
+	}
+}
+
+/*
+ * Swap out, fault in via a write to the first page, verify the write
+ * sticks and the rest of the THP is preserved.
+ */
+TEST_F(pmd_swap, write)
+{
+	unsigned int seed = self->seed;
+	char *mem;
+	unsigned long i;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	mem[0] = 0xbb;
+	ASSERT_EQ(mem[0], (char)0xbb);
+
+	for (i = 1; i < self->pmd_size; i++)
+		ASSERT_EQ((unsigned char)mem[i], pattern_byte(seed, i));
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * munmap while the folio is swapped out. Exercises zap_huge_pmd() on a
+ * PMD swap entry — must free the swap slots without trying to look up
+ * a folio.
+ */
+TEST_F(pmd_swap, munmap)
+{
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * Change protection on a swapped PMD entry, then fault back in and
+ * verify data. Exercises change_non_present_huge_pmd().
+ */
+TEST_F(pmd_swap, mprotect)
+{
+	unsigned int seed = self->seed;
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ), 0);
+	ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ | PROT_WRITE), 0);
+
+	ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed));
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * mmap an anonymous PMD-aligned region of pmd_size bytes. Over-allocates
+ * by one PMD and trims the unaligned head/tail so the returned address is
+ * PMD-aligned (required for whole-PMD UFFDIO_MOVE).
+ */
+static char *mmap_pmd_aligned(unsigned long pmd_size)
+{
+	unsigned long pad = pmd_size;
+	char *raw, *aligned;
+
+	raw = mmap(NULL, pmd_size + pad, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (raw == MAP_FAILED)
+		return MAP_FAILED;
+
+	aligned = (char *)(((uintptr_t)raw + pmd_size - 1) & ~(pmd_size - 1));
+	if (aligned != raw)
+		munmap(raw, aligned - raw);
+	if (aligned + pmd_size != raw + pmd_size + pad)
+		munmap(aligned + pmd_size,
+		       (raw + pmd_size + pad) - (aligned + pmd_size));
+	return aligned;
+}
+
+/*
+ * UFFDIO_MOVE a PMD swap entry from src to a registered dst. Exercises
+ * move_pages_huge_pmd() handling of pmd_is_swap_entry: the whole PMD swap
+ * entry must move to dst without splitting, and the destination must
+ * read back the original pattern after a swap-in fault.
+ */
+TEST_F(pmd_swap, uffdio_move)
+{
+	unsigned int seed = self->seed;
+	struct uffdio_register reg = {};
+	struct uffdio_move move = {};
+	struct uffdio_api api = {};
+	char *src, *dst;
+	int uffd;
+
+	dst = mmap_pmd_aligned(self->pmd_size);
+	if (dst == MAP_FAILED)
+		SKIP(return, "Could not mmap aligned dst\n");
+
+	src = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (src == MAP_FAILED) {
+		munmap(dst, self->pmd_size);
+		SKIP(return, "Could not create swapped THP\n");
+	}
+	if ((uintptr_t)src & (self->pmd_size - 1)) {
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "src not PMD-aligned\n");
+	}
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "userfaultfd unavailable\n");
+	}
+
+	api.api = UFFD_API;
+	api.features = UFFD_FEATURE_MOVE;
+	if (ioctl(uffd, UFFDIO_API, &api) ||
+	    !(api.features & UFFD_FEATURE_MOVE)) {
+		close(uffd);
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "UFFD_FEATURE_MOVE unsupported\n");
+	}
+
+	reg.range.start = (unsigned long)dst;
+	reg.range.len = self->pmd_size;
+	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &reg)) {
+		close(uffd);
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		SKIP(return, "UFFDIO_REGISTER failed\n");
+	}
+
+	move.dst = (unsigned long)dst;
+	move.src = (unsigned long)src;
+	move.len = self->pmd_size;
+	if (ioctl(uffd, UFFDIO_MOVE, &move)) {
+		close(uffd);
+		munmap(src, self->pmd_size);
+		munmap(dst, self->pmd_size);
+		ASSERT_EQ(errno, 0);
+	}
+	ASSERT_EQ(move.move, self->pmd_size);
+
+	/*
+	 * dst inherits the PMD swap entry; reading it must fault the THP
+	 * back in via do_huge_pmd_swap_page() and yield the original data.
+	 */
+	ASSERT_TRUE(check_swapped(self->pagemap_fd, dst, self->pmd_size));
+	ASSERT_TRUE(verify_pattern(dst, self->pmd_size, seed));
+	/* The whole-PMD path must reinstate a THP, not 512 PTE folios. */
+	ASSERT_TRUE(check_huge_anon(dst, 1, self->pmd_size));
+
+	close(uffd);
+	munmap(src, self->pmd_size);
+	munmap(dst, self->pmd_size);
+}
+
+/*
+ * Move a swapped PMD entry to a new address, fault in, verify data.
+ * Exercises move_huge_pmd() and move_soft_dirty_pmd().
+ */
+TEST_F(pmd_swap, mremap)
+{
+	unsigned int seed = self->seed;
+	char *mem, *new_mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	new_mem = mremap(mem, self->pmd_size, self->pmd_size, MREMAP_MAYMOVE);
+	if (new_mem == MAP_FAILED) {
+		munmap(mem, self->pmd_size);
+		ASSERT_NE(new_mem, MAP_FAILED);
+	}
+
+	ASSERT_TRUE(verify_pattern(new_mem, self->pmd_size, seed));
+
+	munmap(new_mem, self->pmd_size);
+}
+
+/*
+ * Read /proc/self/pagemap on a PMD swap entry. Exercises the pagemap
+ * PMD walker which must handle PMD swap entries without trying to
+ * convert them to a page via softleaf_to_page().
+ */
+TEST_F(pmd_swap, pagemap)
+{
+	char *mem;
+	uint64_t entry;
+	unsigned long off;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	for (off = 0; off < self->pmd_size; off += getpagesize()) {
+		entry = pagemap_get_entry(self->pagemap_fd, mem + off);
+		/* Bit 62 = swapped */
+		ASSERT_TRUE(entry & (1ULL << 62));
+	}
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * MADV_FREE on a swapped-out PMD must free the swap slots and clear the
+ * entry. After the call, pagemap must no longer report the pages as
+ * swapped, and accessing the region must yield zero pages.
+ */
+TEST_F(pmd_swap, madvise_free)
+{
+	char *mem;
+	unsigned long i;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	ASSERT_TRUE(check_swapped(self->pagemap_fd, mem, self->pmd_size));
+	ASSERT_EQ(madvise(mem, self->pmd_size, MADV_FREE), 0);
+	ASSERT_FALSE(check_swapped(self->pagemap_fd, mem, self->pmd_size));
+
+	for (i = 0; i < self->pmd_size; i += getpagesize())
+		ASSERT_EQ(mem[i], 0);
+
+	munmap(mem, self->pmd_size);
+}
+
+/*
+ * swapoff requires a dedicated swap device path. Use a separate fixture
+ * that picks the device up from the PMD_SWAP_DEVICE environment variable
+ * and skips when unset.
+ */
+FIXTURE(pmd_swap_swapoff)
+{
+	unsigned long pmd_size;
+	int pagemap_fd;
+	const char *swap_dev;
+	unsigned int seed;
+};
+
+FIXTURE_SETUP(pmd_swap_swapoff)
+{
+	self->pagemap_fd = -1;
+	self->swap_dev = getenv("PMD_SWAP_DEVICE");
+	if (!self->swap_dev)
+		SKIP(return, "PMD_SWAP_DEVICE env var not set\n");
+
+	self->pmd_size = read_pmd_pagesize();
+	if (!self->pmd_size)
+		SKIP(return, "Cannot determine PMD size\n");
+
+	self->pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (self->pagemap_fd < 0)
+		SKIP(return, "Cannot open /proc/self/pagemap\n");
+
+	if (!swap_available(self->pagemap_fd))
+		SKIP(return, "Swap not available or not working\n");
+
+	self->seed = random_seed();
+}
+
+FIXTURE_TEARDOWN(pmd_swap_swapoff)
+{
+	if (self->pagemap_fd >= 0)
+		close(self->pagemap_fd);
+}
+
+/*
+ * Swap out a THP, then turn off swap. The kernel must fault the entire
+ * THP back in via unuse_pmd(), preserving the huge mapping. Verify data
+ * is intact and the THP mapping is preserved.
+ */
+TEST_F(pmd_swap_swapoff, basic)
+{
+	unsigned int seed = self->seed;
+	char *mem;
+
+	mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+	if (mem == MAP_FAILED)
+		SKIP(return, "Could not create swapped THP\n");
+
+	if (swapoff(self->swap_dev)) {
+		munmap(mem, self->pmd_size);
+		ASSERT_EQ(swapoff(self->swap_dev), 0);
+	}
+
+	if (!verify_pattern(mem, self->pmd_size, seed)) {
+		swapon(self->swap_dev, 0);
+		munmap(mem, self->pmd_size);
+		ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed));
+	}
+
+	if (!check_huge_anon(mem, 1, self->pmd_size)) {
+		swapon(self->swap_dev, 0);
+		munmap(mem, self->pmd_size);
+		ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size));
+	}
+
+	if (swapon(self->swap_dev, 0))
+		fprintf(stderr, "Warning: swapon failed: %s\n",
+			strerror(errno));
+
+	munmap(mem, self->pmd_size);
+}
+
+TEST_HARNESS_MAIN
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (12 preceding siblings ...)
  2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
@ 2026-04-27 13:38 ` Usama Arif
  2026-04-27 18:26 ` Zi Yan
  14 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 13:38 UTC (permalink / raw)
  To: Andrew Morton, david, chrisl, kasong, ljs, ziy, Hugh Dickins
  Cc: bhe, willy, youngjun.park, hannes, riel, shakeel.butt, alex, kas,
	baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team



On 27/04/2026 11:01, Usama Arif wrote:
> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
> 
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,
> no khugepaged involvement is needed. swap_map metadata is identical
> either way (512 single-slot counts), so the PTE split buys nothing
> on the swap side, it is purely a page-table representation change.
> 
> This work was brought about after Hugh reported that one of the
> major blockers for having lazy page table deposit is the lack of
> PMD swap entries [1]. However, this series has benefits of its
> own:


+Hugh. Hugh raised this in [1], and I completely forgot to add him
to the series, sorry about that!

[1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs
  2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
                   ` (13 preceding siblings ...)
  2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
@ 2026-04-27 18:26 ` Zi Yan
  2026-04-27 20:12   ` Usama Arif
  14 siblings, 1 reply; 17+ messages in thread
From: Zi Yan @ 2026-04-27 18:26 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, chrisl, kasong, ljs, bhe, willy,
	youngjun.park, hannes, riel, shakeel.butt, alex, kas, baohua,
	dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Ying Huang

+Ying, who did the original THP swap work[1].

[1] https://lkml.org/lkml/2016/8/9/588

On 27 Apr 2026, at 6:01, Usama Arif wrote:

> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
>
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,
> no khugepaged involvement is needed. swap_map metadata is identical
> either way (512 single-slot counts), so the PTE split buys nothing
> on the swap side, it is purely a page-table representation change.
>
> This work was brought about after Hugh reported that one of the
> major blockers for having lazy page table deposit is the lack of
> PMD swap entries [1]. However, this series has benefits of its
> own:
> - The huge mapping is restored on swap-in.  Today even when the
>   folio is still in swap cache as a single 2 MB folio, the swap-in
>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>   freshly-materialised PTE table sticks around, and only
>   khugepaged can later collapse the range back into a THP.
>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>   one fault, no khugepaged involvement.
> - Memory saved per swapped-out THP *once lazy page table deposit is
>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>   The new path leaves the pgtable hierarchy at PMD level and avoids
>   that allocation entirely.
>   This will save memory when swapping, which is likely when there is
>   memory pressure and exactly when allocations are most likely to
>   fail.
> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>   visit one PMD entry instead of 512 PTEs, reducing traversal
>   time and lock-hold windows.
>
> The swap entry value is identical to 512 PTE swap entries (same
> type, same starting offset), so swap_map refcounting is unchanged.
> Only the page-table representation differs; the swap slot allocator,
> swap I/O, and swap cache are untouched.  The new path falls back to
> the existing PTE-split path whenever a PMD-order resource is
> unavailable: zswap enabled, non-contiguous swap allocation
> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
> or fork, racing folio split, or rmap-driven split on a swapcache
> folio.  Walkers that previously assumed every non-present PMD encodes
> a PFN (migration / device_private) are taught to recognise PMD swap
> entries.
>
> Patch breakdown:
>
> The series is ordered to preserve git bisectability: every consumer
> of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
> swap-in fault) lands before the producer.  The swap-out path that
> actually installs PMD swap entries is the very last functional patch
> (12), so no intermediate commit can leave the kernel handling a
> PMD swap entry it does not yet understand.
>
> The first 4 patches are preparatory patches. Some of them (like
> softleaf_to_pmd() change in patch 1) are not exactly needed but its
> done to hopefully improve code quality and so that the PMD swap
> entry changes look well integrated with the rest of mm.
>
> Prep patches:
>   1. mm: add softleaf_to_pmd() and convert existing callers
>      PMD counterpart to softleaf_to_pte(); needed to construct a
>      PMD from a swap entry in later patches.
>   2. mm: extract ensure_on_mmlist() helper
>      Hoists the "register mm with swapoff" double-checked-locking
>      pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
>      the PMD swap-out and PMD fork paths can reuse it without a
>      third open-coded copy.
>   3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>      pagemap_pmd_range_thp() today calls softleaf_to_page()
>      unconditionally; a PMD swap entry has no PFN and would crash
>      it.
>   4. mm/huge_memory: move softleaf_to_folio() inside migration branch
>      change_non_present_huge_pmd() today calls softleaf_to_folio()
>      before branching on entry type, so a PMD swap entry would
>      produce a bogus folio pointer that the migration-only code
>      below would then dereference.
>
> Core patches:
>   5. PMD swap entry detection (pmd_is_swap_entry,
>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>      helpers (x86/arm64/s390/riscv/loongarch).
>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>      into 512 PTE swap entries, used as the fallback when a
>      PMD-order resource is unavailable.

I was wondering how to handle insufficient memory during swap-in.
Here it is. I have not read the code, but the split should be
straightforward, since we already have a contiguous swap space at
swap-out time and the split is just to enable PTE-level swap in, right?

>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>      copy_pte_range().
>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>      the PMD; falls back to PTE-split + unuse_pte_range() on error.
>   9. Walker updates: zap_huge_pmd, change_huge_pmd,
>      change_non_present_huge_pmd, move_soft_dirty_pmd,
>      clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
>      queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
>      and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
>      VM_BUG_ON extensions.
>  10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
>      entry whole via a new move_swap_pmd() helper modeled on
>      move_swap_pte().
>  11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
>      one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
>      mapping, immediate COW for write faults; falls back to PTE-split
>      on any PMD-order resource shortfall.
>  12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
>      PMD-mappable swapcache folios (when zswap is disabled), and
>      try_to_unmap_one() installs one PMD swap entry via
>      set_pmd_swap_entry() instead of splitting.
>
> Testing:
>  13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
>      repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
>      MADV_FREE, UFFDIO_MOVE, swapoff.
>
> Making PMD swap entries work with zswap is another project on its own and
> should be in a separate follow up series.
>
> The patches are on top of mm-unstable from 23 April
> (2bcc13c29c711381d815c1ba5d5b25737400c71a).
>
> [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
> [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/
>
> Usama Arif (13):
>   mm: add softleaf_to_pmd() and convert existing callers
>   mm: extract ensure_on_mmlist() helper
>   fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>   mm/huge_memory: move softleaf_to_folio() inside migration branch
>   mm: add PMD swap entry detection support
>   mm: add PMD swap entry splitting support
>   mm: handle PMD swap entries in fork path
>   mm: swap in PMD swap entries as whole THPs during swapoff
>   mm: handle PMD swap entries in non-present PMD walkers
>   mm: handle PMD swap entries in UFFDIO_MOVE
>   mm: handle PMD swap entry faults on swap-in
>   mm: install PMD swap entries on swap-out
>   selftests/mm: add PMD swap entry tests
>
>  arch/arm64/include/asm/pgtable.h      |   4 +
>  arch/loongarch/include/asm/pgtable.h  |  17 +
>  arch/riscv/include/asm/pgtable.h      |  15 +
>  arch/s390/include/asm/pgtable.h       |  15 +
>  arch/x86/include/asm/pgtable.h        |  15 +
>  fs/proc/task_mmu.c                    |  47 +-
>  include/linux/huge_mm.h               |  11 +
>  include/linux/leafops.h               |  44 +-
>  include/linux/swap.h                  |   4 +-
>  include/linux/vm_event_item.h         |   1 +
>  mm/hmm.c                              |   3 +-
>  mm/huge_memory.c                      | 540 +++++++++++++++++++++--
>  mm/internal.h                         |  49 +++
>  mm/khugepaged.c                       |   6 +
>  mm/madvise.c                          |   5 +-
>  mm/memory.c                           |  51 +--
>  mm/mempolicy.c                        |   2 +
>  mm/rmap.c                             |  27 +-
>  mm/swap.h                             |   7 +
>  mm/swap_state.c                       |  35 ++
>  mm/swapfile.c                         | 144 +++++-
>  mm/vmscan.c                           |  14 +-
>  mm/vmstat.c                           |   1 +
>  tools/testing/selftests/mm/Makefile   |   1 +
>  tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
>  25 files changed, 1554 insertions(+), 111 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/pmd_swap.c
>
> -- 
> 2.52.0


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs
  2026-04-27 18:26 ` Zi Yan
@ 2026-04-27 20:12   ` Usama Arif
  0 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2026-04-27 20:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, david, chrisl, kasong, ljs, bhe, willy,
	youngjun.park, hannes, riel, shakeel.butt, alex, kas, baohua,
	dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts,
	Vlastimil Babka, lance.yang, linux-kernel, nphamcs, shikemeng,
	kernel-team, Ying Huang, Linux Memory Management List



On 27/04/2026 19:26, Zi Yan wrote:
> +Ying, who did the original THP swap work[1].
> 
> [1] https://lkml.org/lkml/2016/8/9/588
> 

Thanks Zi!

Sorry Ying for not CCing you! checkpatch on the whole series produced
a really long list and I wasnt sure if people would start thinking of
it as spam. I added reviewers and maintainers of swap and THP + a few
folks that commented on previous related work from which this kicked off.
I should have just CC'ed everyone.

> On 27 Apr 2026, at 6:01, Usama Arif wrote:
> 
>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>> unmap.
>>
>> This series introduces a PMD-level swap entry. The huge mapping is
>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>> resolves the entire 2 MB region in a single fault on swap-in,
>> no khugepaged involvement is needed. swap_map metadata is identical
>> either way (512 single-slot counts), so the PTE split buys nothing
>> on the swap side, it is purely a page-table representation change.
>>
>> This work was brought about after Hugh reported that one of the
>> major blockers for having lazy page table deposit is the lack of
>> PMD swap entries [1]. However, this series has benefits of its
>> own:
>> - The huge mapping is restored on swap-in.  Today even when the
>>   folio is still in swap cache as a single 2 MB folio, the swap-in
>>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>>   freshly-materialised PTE table sticks around, and only
>>   khugepaged can later collapse the range back into a THP.
>>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>>   one fault, no khugepaged involvement.
>> - Memory saved per swapped-out THP *once lazy page table deposit is
>>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>>   The new path leaves the pgtable hierarchy at PMD level and avoids
>>   that allocation entirely.
>>   This will save memory when swapping, which is likely when there is
>>   memory pressure and exactly when allocations are most likely to
>>   fail.
>> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>>   visit one PMD entry instead of 512 PTEs, reducing traversal
>>   time and lock-hold windows.
>>
>> The swap entry value is identical to 512 PTE swap entries (same
>> type, same starting offset), so swap_map refcounting is unchanged.
>> Only the page-table representation differs; the swap slot allocator,
>> swap I/O, and swap cache are untouched.  The new path falls back to
>> the existing PTE-split path whenever a PMD-order resource is
>> unavailable: zswap enabled, non-contiguous swap allocation
>> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
>> or fork, racing folio split, or rmap-driven split on a swapcache
>> folio.  Walkers that previously assumed every non-present PMD encodes
>> a PFN (migration / device_private) are taught to recognise PMD swap
>> entries.
>>
>> Patch breakdown:
>>
>> The series is ordered to preserve git bisectability: every consumer
>> of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
>> swap-in fault) lands before the producer.  The swap-out path that
>> actually installs PMD swap entries is the very last functional patch
>> (12), so no intermediate commit can leave the kernel handling a
>> PMD swap entry it does not yet understand.
>>
>> The first 4 patches are preparatory patches. Some of them (like
>> softleaf_to_pmd() change in patch 1) are not exactly needed but its
>> done to hopefully improve code quality and so that the PMD swap
>> entry changes look well integrated with the rest of mm.
>>
>> Prep patches:
>>   1. mm: add softleaf_to_pmd() and convert existing callers
>>      PMD counterpart to softleaf_to_pte(); needed to construct a
>>      PMD from a swap entry in later patches.
>>   2. mm: extract ensure_on_mmlist() helper
>>      Hoists the "register mm with swapoff" double-checked-locking
>>      pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
>>      the PMD swap-out and PMD fork paths can reuse it without a
>>      third open-coded copy.
>>   3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>      pagemap_pmd_range_thp() today calls softleaf_to_page()
>>      unconditionally; a PMD swap entry has no PFN and would crash
>>      it.
>>   4. mm/huge_memory: move softleaf_to_folio() inside migration branch
>>      change_non_present_huge_pmd() today calls softleaf_to_folio()
>>      before branching on entry type, so a PMD swap entry would
>>      produce a bogus folio pointer that the migration-only code
>>      below would then dereference.
>>
>> Core patches:
>>   5. PMD swap entry detection (pmd_is_swap_entry,
>>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>>      helpers (x86/arm64/s390/riscv/loongarch).
>>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>>      into 512 PTE swap entries, used as the fallback when a
>>      PMD-order resource is unavailable.
> 
> I was wondering how to handle insufficient memory during swap-in.
> Here it is. I have not read the code, but the split should be
> straightforward, since we already have a contiguous swap space at
m> swap-out time and the split is just to enable PTE-level swap in, right?
> 

Yes that is correct. Actually patch 6 was one of the easier patches.
If the kernel can't allocate 2M, memcg charge fails and a few other reasons,
we split THP.


>>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>>      copy_pte_range().
>>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>>      the PMD; falls back to PTE-split + unuse_pte_range() on error.
>>   9. Walker updates: zap_huge_pmd, change_huge_pmd,
>>      change_non_present_huge_pmd, move_soft_dirty_pmd,
>>      clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
>>      queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
>>      and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
>>      VM_BUG_ON extensions.
>>  10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
>>      entry whole via a new move_swap_pmd() helper modeled on
>>      move_swap_pte().
>>  11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
>>      one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
>>      mapping, immediate COW for write faults; falls back to PTE-split
>>      on any PMD-order resource shortfall.
>>  12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
>>      PMD-mappable swapcache folios (when zswap is disabled), and
>>      try_to_unmap_one() installs one PMD swap entry via
>>      set_pmd_swap_entry() instead of splitting.
>>
>> Testing:
>>  13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
>>      repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
>>      MADV_FREE, UFFDIO_MOVE, swapoff.
>>
>> Making PMD swap entries work with zswap is another project on its own and
>> should be in a separate follow up series.
>>
>> The patches are on top of mm-unstable from 23 April
>> (2bcc13c29c711381d815c1ba5d5b25737400c71a).
>>
>> [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
>> [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/
>>
>> Usama Arif (13):
>>   mm: add softleaf_to_pmd() and convert existing callers
>>   mm: extract ensure_on_mmlist() helper
>>   fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>   mm/huge_memory: move softleaf_to_folio() inside migration branch
>>   mm: add PMD swap entry detection support
>>   mm: add PMD swap entry splitting support
>>   mm: handle PMD swap entries in fork path
>>   mm: swap in PMD swap entries as whole THPs during swapoff
>>   mm: handle PMD swap entries in non-present PMD walkers
>>   mm: handle PMD swap entries in UFFDIO_MOVE
>>   mm: handle PMD swap entry faults on swap-in
>>   mm: install PMD swap entries on swap-out
>>   selftests/mm: add PMD swap entry tests
>>
>>  arch/arm64/include/asm/pgtable.h      |   4 +
>>  arch/loongarch/include/asm/pgtable.h  |  17 +
>>  arch/riscv/include/asm/pgtable.h      |  15 +
>>  arch/s390/include/asm/pgtable.h       |  15 +
>>  arch/x86/include/asm/pgtable.h        |  15 +
>>  fs/proc/task_mmu.c                    |  47 +-
>>  include/linux/huge_mm.h               |  11 +
>>  include/linux/leafops.h               |  44 +-
>>  include/linux/swap.h                  |   4 +-
>>  include/linux/vm_event_item.h         |   1 +
>>  mm/hmm.c                              |   3 +-
>>  mm/huge_memory.c                      | 540 +++++++++++++++++++++--
>>  mm/internal.h                         |  49 +++
>>  mm/khugepaged.c                       |   6 +
>>  mm/madvise.c                          |   5 +-
>>  mm/memory.c                           |  51 +--
>>  mm/mempolicy.c                        |   2 +
>>  mm/rmap.c                             |  27 +-
>>  mm/swap.h                             |   7 +
>>  mm/swap_state.c                       |  35 ++
>>  mm/swapfile.c                         | 144 +++++-
>>  mm/vmscan.c                           |  14 +-
>>  mm/vmstat.c                           |   1 +
>>  tools/testing/selftests/mm/Makefile   |   1 +
>>  tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
>>  25 files changed, 1554 insertions(+), 111 deletions(-)
>>  create mode 100644 tools/testing/selftests/mm/pmd_swap.c
>>
>> -- 
>> 2.52.0
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-04-27 20:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-04-27 10:01 ` [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-04-27 10:02 ` [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-04-27 10:02 ` [PATCH 12/13] mm: install PMD swap entries on swap-out Usama Arif
2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 18:26 ` Zi Yan
2026-04-27 20:12   ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox