* [v2 00/16] mm: PMD-level swap entries for anonymous THPs
@ 2026-06-02 14:24 Usama Arif
2026-06-02 14:24 ` [v2 01/16] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
` (16 more replies)
0 siblings, 17 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
unmap.
This series introduces a PMD-level swap entry. The huge mapping is
preserved across the swap round-trip, and do_huge_pmd_swap_page()
resolves the entire 2 MB region in a single fault on swap-in,
no khugepaged involvement is needed. swap_map metadata is identical
either way (512 single-slot counts), so the PTE split buys nothing
on the swap side, it is purely a page-table representation change.
This work was brought about after Hugh reported that one of the
major blockers for having lazy page table deposit is the lack of
PMD swap entries [1]. However, this series has benefits of its
own:
- The huge mapping is restored on swap-in. Today even when the
folio is still in swap cache as a single 2 MB folio, the swap-in
path installs 512 PTE mappings -- the PMD mapping is gone, the
freshly-materialised PTE table sticks around, and only
khugepaged can later collapse the range back into a THP.
do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
one fault, no khugepaged involvement.
- Memory saved per swapped-out THP *once lazy page table deposit is
merged* [2]. With lazy page table deposit [2], splitting a PMD into
512 PTE swap entries forces allocation of a 4 KB PTE table page.
The new path leaves the pgtable hierarchy at PMD level and avoids
that allocation entirely.
This will save memory when swapping, which is likely when there is
memory pressure and exactly when allocations are most likely to
fail.
- Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
visit one PMD entry instead of 512 PTEs, reducing traversal
time and lock-hold windows.
The swap entry value is identical to 512 PTE swap entries (same
type, same starting offset), so swap_map refcounting is unchanged.
Only the page-table representation differs; the swap slot allocator,
swap I/O, and swap cache are untouched. The new path falls back to
the existing PTE-split path whenever a PMD-order resource is
unavailable: zswap enabled, non-contiguous swap allocation
(THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
or fork, racing folio split, or rmap-driven split on a swapcache
folio. Walkers that previously assumed every non-present PMD encodes
a PFN (migration / device_private) are taught to recognise PMD swap
entries.
Patch breakdown:
The series is ordered to preserve git bisectability: every consumer
of a PMD swap entry (split, fork, swapoff, walkers, MADV_WILLNEED,
UFFDIO_MOVE, swap-in fault) lands before the producer. The swap-out
path that actually installs PMD swap entries is the very last
functional patch (15), so no intermediate commit can leave the
kernel handling a PMD swap entry it does not yet understand.
The first 6 patches are preparatory patches. Some of them (like
softleaf_to_pmd() change in patch 1) are not exactly needed but its
done to hopefully improve code quality and so that the PMD swap
entry changes look well integrated with the rest of mm.
Prep patches:
1. mm: add softleaf_to_pmd() and convert existing callers
PMD counterpart to softleaf_to_pte(); needed to construct a
PMD from a swap entry in later patches.
2. mm: extract mm_prepare_for_swap_entries() helper
Hoists the "register mm with swapoff" double-checked-locking
pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
the PMD swap-out and PMD fork paths can reuse it without a
third open-coded copy.
3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
pagemap_pmd_range_thp() today calls softleaf_to_page()
unconditionally; a PMD swap entry has no PFN and would crash
it.
4. mm/huge_memory: move softleaf_to_folio() inside migration branch
change_non_present_huge_pmd() today calls softleaf_to_folio()
before branching on entry type, so a PMD swap entry would
produce a bogus folio pointer that the migration-only code
below would then dereference.
5. mm/migrate_device: move softleaf_to_folio() inside device-private
branch
migrate_vma_collect_pmd() has the same pre-check ordering issue
as patch 4 in the migrate-device PMD walker; move the folio
lookup inside the device-private check.
6. mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF
The config gates the entire PMD softleaf machinery (migration,
device-private, and now swap), not just migration; rename to
match. Pure rename, no behavioural change.
Core patches:
7. PMD swap entry detection (pmd_is_swap_entry,
softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
helpers (x86/arm64/s390/riscv/loongarch/powerpc).
8. __split_huge_pmd_locked() learns to split a PMD swap entry
into 512 PTE swap entries, used as the fallback when a
PMD-order resource is unavailable.
9. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
in one swap_dup_entries_direct(HPAGE_PMD_NR) call, with GFP_KERNEL
retry on per-cluster table-allocation failure mirroring
copy_pte_range().
10. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
the PMD; falls back to PTE-split + unuse_pte_range() on error.
11. Walker updates: zap_huge_pmd, change_huge_pmd,
change_non_present_huge_pmd, move_soft_dirty_pmd,
clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
mincore_pte_range, and the madvise_cold_or_pageout_pte_range
/ madvise_free_huge_pmd VM_BUG_ON extensions.
12. MADV_WILLNEED: swapin_walk_pmd_entry() reads the whole 2 MB
folio in at PMD order via swapin_sync(BIT(HPAGE_PMD_ORDER)),
so the subsequent fault hits do_huge_pmd_swap_page() and
restores the THP mapping. A naive order-0 read-ahead would
force the fault to split.
13. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
entry whole via a new move_swap_pmd() helper modeled on
move_swap_pte().
14. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
one shot. Handles racing splits, SWP_STABLE_WRITES read-only
mapping, immediate COW for write faults; falls back to PTE-split
on any PMD-order resource shortfall.
15. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
PMD-mappable swapcache folios (when zswap is disabled), and
try_to_unmap_one() installs one PMD swap entry via
set_pmd_swap_entry() instead of splitting.
Testing:
16. selftests/mm: 13 tests covering swap-out/in, fork, fork+COW,
repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
MADV_FREE, MADV_WILLNEED, UFFDIO_MOVE, swapoff.
Making PMD swap entries work with zswap is another project on its own and
should be in a separate follow up series.
The patches are on top of mm-new from 31 May
(415489ef1cdfe586b4992662bee65286d50232e6).
[1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
[2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/
v1 -> v2: https://lore.kernel.org/all/20260427100553.2754667-1-usama.arif@linux.dev/
- Patch 1: convert two additional softleaf_to_pmd() callers that
landed in mm-unstable since v1 (mm/debug_vm_pgtable.c,
mm/migrate_device.c) (Dev)
- Patch 2: rename helper ensure_on_mmlist() to
mm_prepare_for_swap_entries() to better describe its purpose
(David)
- Patch 3: drop VM_WARN_ON_ONCE(!pmd_is_migration_entry) as
Dev posted it as a separate patch.
- Patch 5 (new): move softleaf_to_folio() inside the device-private
branch in migrate_vma_collect_pmd(); same class of fix as patch 4
but for the migrate-device PMD walker.
- Patch 6 (new): rename CONFIG_ARCH_ENABLE_THP_MIGRATION to
CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF so the gate that now drives
swap-entry support too is named for what it actually controls
(PMD softleaf entries), not just migration. (Dev)
- Patch 7: add the missing pmd_swp_exclusive / mkexclusive /
clear_exclusive helpers for powerpc.
- Patches 10 and 14: use upstream swapin_sync() (bundles
swap_cache_alloc_folio + swap_read_folio + the -EEXIST race
retry) instead of the bespoke swapin_alloc_pmd_folio() helper
from v1; do_swap_page and shmem_swapin_folio use the same
helper (Kairui)
- Patch 10: construct a stack vm_fault for the swapoff swap-in so
the allocator can resolve a mempolicy, mirroring how the PTE
swapoff path (unuse_pte_range) already does it.
- Patch 11: extend coverage to check_pmd_state() in khugepaged so a
swapped-out PMD-mapped THP is treated as SCAN_PMD_MAPPED (matches
the existing migration-entry handling). Route the
pmd_trans_huge_lock() branch of mincore_pte_range() through
mincore_swap() so a swapped-out PMD-mapped THP isn't reported as
resident.
- Patch 12 (new): handle PMD swap entries in MADV_WILLNEED via
swapin_sync(BIT(HPAGE_PMD_ORDER)); a naive order-0 read-ahead
would force the subsequent fault to split.
- Patch 13: refuse UFFDIO_MOVE with -EBUSY if the swap-cache folio
was split between swap-out and the move, matching
move_pages_pte()'s rejection of large folios; otherwise only one
of the 512 anon-rmaps would be re-anchored to dst_vma.
- Patch 16: alloc_fill_swap_thp() now uses the existing
mmap_pmd_aligned() helper so tests don't flake/skip based on VA
placement; new MADV_WILLNEED test that watches the PMD-order
mTHP swpin counter; swapoff test restructured to use the
kselftest_harness ASSERT cleanup blocks (no double swapoff, no
verify-after-munmap).
- Collected Acks and Reviews.
Usama Arif (16):
mm: add softleaf_to_pmd() and convert existing callers
mm: extract mm_prepare_for_swap_entries() helper
fs/proc: use softleaf_has_pfn() in pagemap PMD walker
mm/huge_memory: move softleaf_to_folio() inside migration branch
mm/migrate_device: move softleaf_to_folio() inside device-private
branch
mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF
mm: add PMD swap entry detection support
mm: add PMD swap entry splitting support
mm: handle PMD swap entries in fork path
mm: swap in PMD swap entries as whole THPs during swapoff
mm: handle PMD swap entries in non-present PMD walkers
mm: handle PMD swap entries in MADV_WILLNEED
mm: handle PMD swap entries in UFFDIO_MOVE
mm: handle PMD swap entry faults on swap-in
mm: install PMD swap entries on swap-out
selftests/mm: add PMD swap entry tests
arch/arm64/Kconfig | 2 +-
arch/arm64/include/asm/pgtable.h | 8 +-
arch/loongarch/Kconfig | 2 +-
arch/loongarch/include/asm/pgtable.h | 17 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 17 +-
arch/powerpc/platforms/Kconfig.cputype | 2 +-
arch/riscv/Kconfig | 2 +-
arch/riscv/include/asm/pgtable.h | 23 +-
arch/s390/Kconfig | 2 +-
arch/s390/include/asm/pgtable.h | 17 +-
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pgtable.h | 17 +-
fs/proc/task_mmu.c | 46 +-
include/linux/huge_mm.h | 13 +-
include/linux/leafops.h | 52 +-
include/linux/pgtable.h | 2 +-
include/linux/swap.h | 4 +-
include/linux/swapops.h | 6 +-
include/linux/vm_event_item.h | 1 +
mm/Kconfig | 2 +-
mm/debug_vm_pgtable.c | 12 +-
mm/hmm.c | 7 +-
mm/huge_memory.c | 543 ++++++++++++++-
mm/internal.h | 49 ++
mm/khugepaged.c | 6 +
mm/madvise.c | 45 +-
mm/memory.c | 51 +-
mm/mempolicy.c | 2 +
mm/migrate.c | 4 +-
mm/migrate_device.c | 19 +-
mm/mincore.c | 14 +-
mm/rmap.c | 29 +-
mm/swapfile.c | 152 ++++-
mm/vmscan.c | 14 +-
mm/vmstat.c | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/pmd_swap.c | 672 +++++++++++++++++++
37 files changed, 1702 insertions(+), 156 deletions(-)
create mode 100644 tools/testing/selftests/mm/pmd_swap.c
--
2.52.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* [v2 01/16] mm: add softleaf_to_pmd() and convert existing callers
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 02/16] mm: extract mm_prepare_for_swap_entries() helper Usama Arif
` (15 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Add softleaf_to_pmd() as the PMD counterpart to softleaf_to_pte(),
completing the symmetry of the softleaf abstraction for page table
leaf entries.
The upcoming PMD swap entry support needs to construct PMD entries
from swap entries. Converting existing swp_entry_to_pmd() callers
to softleaf_to_pmd() in a prep patch keeps the feature patches
focused on new functionality rather than mixing refactoring with
new code.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/leafops.h | 20 ++++++++++++++++++++
mm/debug_vm_pgtable.c | 4 ++--
mm/huge_memory.c | 12 ++++++------
mm/migrate_device.c | 2 +-
4 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 992cd8bd8ed0..803d312437df 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -108,6 +108,21 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
+/**
+ * softleaf_to_pmd() - Obtain a PMD entry from a leaf entry.
+ * @entry: Leaf entry.
+ *
+ * This generates an architecture-specific PMD entry that can be utilised to
+ * encode the metadata the leaf entry encodes.
+ *
+ * Returns: Architecture-specific PMD entry encoding leaf entry.
+ */
+static inline pmd_t softleaf_to_pmd(softleaf_t entry)
+{
+ /* Temporary until swp_entry_t eliminated. */
+ return swp_entry_to_pmd(entry);
+}
+
#else
static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
@@ -115,6 +130,11 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
return softleaf_mk_none();
}
+static inline pmd_t softleaf_to_pmd(softleaf_t entry)
+{
+ return __pmd(0);
+}
+
#endif
/**
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 23dc3ee09561..18411fb09aab 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -758,7 +758,7 @@ static void __init pmd_leaf_soft_dirty_tests(struct pgtable_debug_args *args)
return;
pr_debug("Validating PMD swap soft dirty\n");
- pmd = swp_entry_to_pmd(args->leaf_entry);
+ pmd = softleaf_to_pmd(args->leaf_entry);
WARN_ON(!pmd_is_huge(pmd));
WARN_ON(!pmd_is_valid_softleaf(pmd));
@@ -829,7 +829,7 @@ static void __init pmd_softleaf_tests(struct pgtable_debug_args *args)
return;
pr_debug("Validating PMD swap\n");
- pmd1 = swp_entry_to_pmd(args->leaf_entry);
+ pmd1 = softleaf_to_pmd(args->leaf_entry);
WARN_ON(!pmd_is_huge(pmd1));
WARN_ON(!pmd_is_valid_softleaf(pmd1));
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7f172f3257e8..15913a37b6df 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1820,7 +1820,7 @@ static void copy_huge_non_present_pmd(
if (softleaf_is_migration_write(entry) ||
softleaf_is_migration_read_exclusive(entry)) {
entry = make_readable_migration_entry(swp_offset(entry));
- pmd = swp_entry_to_pmd(entry);
+ pmd = softleaf_to_pmd(entry);
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
if (pmd_swp_uffd_wp(*src_pmd))
@@ -1833,7 +1833,7 @@ static void copy_huge_non_present_pmd(
*/
if (softleaf_is_device_private_write(entry)) {
entry = make_readable_device_private_entry(swp_offset(entry));
- pmd = swp_entry_to_pmd(entry);
+ pmd = softleaf_to_pmd(entry);
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
@@ -2571,12 +2571,12 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
entry = make_readable_exclusive_migration_entry(swp_offset(entry));
else
entry = make_readable_migration_entry(swp_offset(entry));
- newpmd = swp_entry_to_pmd(entry);
+ newpmd = softleaf_to_pmd(entry);
if (pmd_swp_soft_dirty(*pmd))
newpmd = pmd_swp_mksoft_dirty(newpmd);
} else if (softleaf_is_device_private_write(entry)) {
entry = make_readable_device_private_entry(swp_offset(entry));
- newpmd = swp_entry_to_pmd(entry);
+ newpmd = softleaf_to_pmd(entry);
if (pmd_swp_uffd_wp(*pmd))
newpmd = pmd_swp_mkuffd_wp(newpmd);
} else {
@@ -4901,7 +4901,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
entry = make_migration_entry_young(entry);
if (pmd_dirty(pmdval))
entry = make_migration_entry_dirty(entry);
- pmdswp = swp_entry_to_pmd(entry);
+ pmdswp = softleaf_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
if (pmd_uffd_wp(pmdval))
@@ -4952,7 +4952,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
else
entry = make_readable_device_private_entry(
page_to_pfn(new));
- pmde = swp_entry_to_pmd(entry);
+ pmde = softleaf_to_pmd(entry);
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_swp_mksoft_dirty(pmde);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 554754eb26ff..ab93a8d11b70 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -835,7 +835,7 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
else
swp_entry = make_readable_device_private_entry(
page_to_pfn(page));
- entry = swp_entry_to_pmd(swp_entry);
+ entry = softleaf_to_pmd(swp_entry);
} else {
if (folio_is_zone_device(folio) &&
!folio_is_device_coherent(folio)) {
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 02/16] mm: extract mm_prepare_for_swap_entries() helper
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-06-02 14:24 ` [v2 01/16] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 03/16] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
` (14 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
When a swap entry is installed in a page table, the mm must be added
to init_mm.mmlist so that swapoff can find and unuse its swap entries.
This double-checked locking pattern is currently open-coded in
try_to_unmap_one() and copy_nonpresent_pte().
Move it into mm_prepare_for_swap_entries() in mm/internal.h and convert
both callers so it can be reused by upcoming PMD-level swap entry code
paths that also need to register the mm with swapoff.
copy_nonpresent_pte() previously inserted into &src_mm->mmlist rather
than &init_mm.mmlist, but the insertion point is irrelevant, mmlist
is a circular list and swapoff walks it entirely from init_mm.mmlist,
so only membership matters, not position.
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/internal.h | 13 +++++++++++++
mm/memory.c | 9 +--------
mm/rmap.c | 7 +------
3 files changed, 15 insertions(+), 14 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..ace2f8ef1d35 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1951,4 +1951,17 @@ static inline int get_sysctl_max_map_count(void)
bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags,
unsigned long npages);
+/*
+ * Ensure @mm is on the init_mm.mmlist so swapoff can find it.
+ */
+static inline void mm_prepare_for_swap_entries(struct mm_struct *mm)
+{
+ if (list_empty(&mm->mmlist)) {
+ spin_lock(&mmlist_lock);
+ if (list_empty(&mm->mmlist))
+ list_add(&mm->mmlist, &init_mm.mmlist);
+ spin_unlock(&mmlist_lock);
+ }
+}
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 56be920c56d7..137f34c3fd32 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -953,14 +953,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (swap_dup_entry_direct(entry) < 0)
return -EIO;
- /* make sure dst_mm is on swapoff's mmlist. */
- if (unlikely(list_empty(&dst_mm->mmlist))) {
- spin_lock(&mmlist_lock);
- if (list_empty(&dst_mm->mmlist))
- list_add(&dst_mm->mmlist,
- &src_mm->mmlist);
- spin_unlock(&mmlist_lock);
- }
+ mm_prepare_for_swap_entries(dst_mm);
/* Mark the swap entry as shared. */
if (pte_swp_exclusive(orig_pte)) {
pte = pte_swp_clear_exclusive(orig_pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c77d5dc06e9..b93caabd186f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2304,12 +2304,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
set_pte_at(mm, address, pvmw.pte, pteval);
goto walk_abort;
}
- if (list_empty(&mm->mmlist)) {
- spin_lock(&mmlist_lock);
- if (list_empty(&mm->mmlist))
- list_add(&mm->mmlist, &init_mm.mmlist);
- spin_unlock(&mmlist_lock);
- }
+ mm_prepare_for_swap_entries(mm);
dec_mm_counter(mm, MM_ANONPAGES);
inc_mm_counter(mm, MM_SWAPENTS);
swp_pte = swp_entry_to_pte(entry);
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 03/16] fs/proc: use softleaf_has_pfn() in pagemap PMD walker
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-06-02 14:24 ` [v2 01/16] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-06-02 14:24 ` [v2 02/16] mm: extract mm_prepare_for_swap_entries() helper Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 04/16] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
` (13 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
pagemap_pmd_range_thp() assumes that every non-present PMD is a
migration entry and unconditionally calls softleaf_to_page(). This
will crash on any non-present PMD type that does not encode a PFN,
such as the upcoming PMD-level swap entries.
Guard the page lookup with softleaf_has_pfn(), matching how
pte_to_pagemap_entry() already handles non-present PTEs.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
fs/proc/task_mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d32408f7cd5e..1fb5acd88ad0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2129,7 +2129,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
flags |= PM_SOFT_DIRTY;
if (pmd_swp_uffd_wp(pmd))
flags |= PM_UFFD_WP;
- page = softleaf_to_page(entry);
+ if (softleaf_has_pfn(entry))
+ page = softleaf_to_page(entry);
}
if (page) {
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 04/16] mm/huge_memory: move softleaf_to_folio() inside migration branch
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (2 preceding siblings ...)
2026-06-02 14:24 ` [v2 03/16] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 05/16] mm/migrate_device: move softleaf_to_folio() inside device-private branch Usama Arif
` (12 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
change_non_present_huge_pmd() calls softleaf_to_folio() unconditionally
at the top of the function. softleaf_to_folio() extracts a PFN from
the entry and converts it to a folio pointer, which is only meaningful
for migration and device_private entries that encode a real PFN.
A swap entry encodes a swap offset instead, so softleaf_to_folio()
would produce a bogus pointer and crash on mprotect() when a PMD swap
entry is present.
Move the call into the migration_write branch where the folio is
actually used, so the function is safe for any non-present PMD type.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/huge_memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 15913a37b6df..b7b76eef6617 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2558,11 +2558,12 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
bool uffd_wp_resolve)
{
softleaf_t entry = softleaf_from_pmd(*pmd);
- const struct folio *folio = softleaf_to_folio(entry);
pmd_t newpmd;
VM_WARN_ON(!pmd_is_valid_softleaf(*pmd));
if (softleaf_is_migration_write(entry)) {
+ const struct folio *folio = softleaf_to_folio(entry);
+
/*
* A protection check is difficult so
* just be safe and disable write
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 05/16] mm/migrate_device: move softleaf_to_folio() inside device-private branch
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (3 preceding siblings ...)
2026-06-02 14:24 ` [v2 04/16] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 06/16] mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF Usama Arif
` (11 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
migrate_vma_collect_pmd() calls softleaf_to_folio() on a non-present
PMD before checking the entry's type. softleaf_to_folio() converts
the entry's offset to a PFN, which is only meaningful for migration
or device-private entries.
A PMD swap entry's offset is a swap offset, not a PFN, so the
lookup would either return a bogus folio pointer or trip pfn_to_page
validation on a debug kernel. In the non-device-private path the
returned folio is then unused (the OR short-circuits to
migrate_vma_collect_skip()), but the lookup itself is already
unsafe.
Move the softleaf_to_folio() call inside the device-private branch
where the folio is actually needed, mirroring the equivalent
change_non_present_huge_pmd() fix.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/migrate_device.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index ab93a8d11b70..87f079b64265 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -166,11 +166,14 @@ static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
} else if (!pmd_present(*pmdp)) {
const softleaf_t entry = softleaf_from_pmd(*pmdp);
- folio = softleaf_to_folio(entry);
-
if (!softleaf_is_device_private(entry) ||
- !(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
- (folio->pgmap->owner != migrate->pgmap_owner)) {
+ !(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE)) {
+ spin_unlock(ptl);
+ return migrate_vma_collect_skip(start, end, walk);
+ }
+
+ folio = softleaf_to_folio(entry);
+ if (folio->pgmap->owner != migrate->pgmap_owner) {
spin_unlock(ptl);
return migrate_vma_collect_skip(start, end, walk);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 06/16] mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (4 preceding siblings ...)
2026-06-02 14:24 ` [v2 05/16] mm/migrate_device: move softleaf_to_folio() inside device-private branch Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 07/16] mm: add PMD swap entry detection support Usama Arif
` (10 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
CONFIG_ARCH_ENABLE_THP_MIGRATION started life gating just PMD-level
migration entries, but has grown to gate the entire PMD-level softleaf
machinery: migration entries, device-private entries, and soon swap
entries.
Rename CONFIG_ARCH_ENABLE_THP_MIGRATION to CONFIG_ARCH_SUPPORTS_PMD
_SOFTLEAF to make this clear. This is a pure rename: the set of
selecting architectures (x86, arm64, s390, riscv, loongarch, and
powerpc on PPC_BOOK3S_64) and the gating semantics are unchanged.
No functional change intended.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
arch/arm64/Kconfig | 2 +-
arch/arm64/include/asm/pgtable.h | 4 ++--
arch/loongarch/Kconfig | 2 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +-
arch/powerpc/platforms/Kconfig.cputype | 2 +-
arch/riscv/Kconfig | 2 +-
arch/riscv/include/asm/pgtable.h | 8 ++++----
arch/s390/Kconfig | 2 +-
arch/s390/include/asm/pgtable.h | 2 +-
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pgtable.h | 2 +-
include/linux/huge_mm.h | 2 +-
include/linux/leafops.h | 8 ++++----
include/linux/pgtable.h | 2 +-
include/linux/swapops.h | 6 +++---
mm/Kconfig | 2 +-
mm/debug_vm_pgtable.c | 8 ++++----
mm/hmm.c | 4 ++--
mm/huge_memory.c | 2 +-
mm/migrate.c | 4 ++--
mm/migrate_device.c | 6 +++---
mm/rmap.c | 2 +-
22 files changed, 38 insertions(+), 38 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..c6da904b0339 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,7 +17,7 @@ config ARM64
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_MEMORY_HOTPLUG
select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
- select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_CC_PLATFORM
select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4dfa42b7d053..623099303c7b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1534,10 +1534,10 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(swp) ((pte_t) { (swp).val })
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) })
#define __swp_entry_to_pmd(swp) __pmd((swp).val)
-#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
/*
* Ensure that there are not more swap files than can be encoded in the kernel
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 606597da46b8..20ea972a876c 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -12,7 +12,7 @@ config LOONGARCH
select ARCH_NEEDS_DEFER_KASAN
select ARCH_DISABLE_KASAN_INLINE
select ARCH_ENABLE_MEMORY_HOTPLUG
- select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
select ARCH_HAS_CPU_FINALIZE_INIT
select ARCH_HAS_CURRENT_STACK_POINTER
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..6f30aa8a6490 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1060,7 +1060,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#define pmd_mksoft_dirty(pmd) pte_pmd(pte_mksoft_dirty(pmd_pte(pmd)))
#define pmd_clear_soft_dirty(pmd) pte_pmd(pte_clear_soft_dirty(pmd_pte(pmd)))
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
#define pmd_swp_mksoft_dirty(pmd) pte_pmd(pte_swp_mksoft_dirty(pmd_pte(pmd)))
#define pmd_swp_soft_dirty(pmd) pte_swp_soft_dirty(pmd_pte(pmd))
#define pmd_swp_clear_soft_dirty(pmd) pte_pmd(pte_swp_clear_soft_dirty(pmd_pte(pmd)))
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..4a0fa681bf98 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -112,7 +112,7 @@ config PPC_THP
depends on PPC_RADIX_MMU || (PPC_64S_HASH_MMU && PAGE_SIZE_64KB)
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
- select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE
choice
prompt "CPU selection"
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c5754942cf85..de463524dab1 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -22,7 +22,7 @@ config RISCV
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM_VMEMMAP
select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
- select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index a1a7c6520a09..52cfd7df228b 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -936,7 +936,7 @@ static inline pmd_t pmd_clear_soft_dirty(pmd_t pmd)
return pte_pmd(pte_clear_soft_dirty(pmd_pte(pmd)));
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
static inline bool pmd_swp_soft_dirty(pmd_t pmd)
{
return pte_swp_soft_dirty(pmd_pte(pmd));
@@ -951,7 +951,7 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
{
return pte_pmd(pte_swp_clear_soft_dirty(pmd_pte(pmd)));
}
-#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
@@ -1198,10 +1198,10 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) })
#define __swp_entry_to_pmd(swp) __pmd((swp).val)
-#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
/*
* In the RV64 Linux scheme, we give the user half of the virtual-address space
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ecbcbb781e40..046866a0b44d 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -85,7 +85,7 @@ config S390
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM
select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
- select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE
select ARCH_HAS_CC_CAN_LINK
select ARCH_HAS_CPU_FINALIZE_INIT
select ARCH_HAS_CURRENT_STACK_POINTER
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2c6cee8241e0..83d4516825f0 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -903,7 +903,7 @@ static inline pmd_t pmd_clear_soft_dirty(pmd_t pmd)
return clear_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_SOFT_DIRTY));
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
#define pmd_swp_soft_dirty(pmd) pmd_soft_dirty(pmd)
#define pmd_swp_mksoft_dirty(pmd) pmd_mksoft_dirty(pmd)
#define pmd_swp_clear_soft_dirty(pmd) pmd_clear_soft_dirty(pmd)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f3f7cb01d69d..33c6920555b1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -70,7 +70,7 @@ config X86
select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64
select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE)
- select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
+ select ARCH_SUPPORTS_PMD_SOFTLEAF if X86_64 && TRANSPARENT_HUGEPAGE
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
select ARCH_HAS_CPU_ATTACK_VECTORS if CPU_MITIGATIONS
select ARCH_HAS_CACHE_LINE_SIZE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2187e9cfcefa..6efc7980c95a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1533,7 +1533,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad20f7f8c179..1487bf4af1a7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -567,7 +567,7 @@ static inline struct folio *get_persistent_huge_zero_folio(void)
static inline bool thp_migration_supported(void)
{
- return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+ return IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF);
}
void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 803d312437df..88888daeb018 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -81,7 +81,7 @@ static inline pte_t softleaf_to_pte(softleaf_t entry)
return swp_entry_to_pte(entry);
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
/**
* softleaf_from_pmd() - Obtain a leaf entry from a PMD entry.
* @pmd: PMD entry.
@@ -587,7 +587,7 @@ static inline bool pte_is_uffd_marker(pte_t pte)
return false;
}
-#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF)
/**
* pmd_is_device_private_entry() - Check if PMD contains a device private swap
@@ -606,14 +606,14 @@ static inline bool pmd_is_device_private_entry(pmd_t pmd)
return softleaf_is_device_private(softleaf_from_pmd(pmd));
}
-#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
static inline bool pmd_is_device_private_entry(pmd_t pmd)
{
return false;
}
-#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
/**
* pmd_is_migration_entry() - Does this PMD entry encode a migration entry?
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdd68ed3ae1a..5ee80194e052 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1781,7 +1781,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
#endif
#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
-#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifndef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
{
return pmd;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 8cfc966eae48..705a84154d28 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -321,7 +321,7 @@ static inline swp_entry_t make_guard_swp_entry(void)
struct page_vma_mapped_walk;
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
extern int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
struct page *page);
@@ -338,7 +338,7 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
return __swp_entry_to_pmd(arch_entry);
}
-#else /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#else /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
struct page *page)
{
@@ -358,7 +358,7 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
return __pmd(0);
}
-#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
#endif /* CONFIG_MMU */
#endif /* _LINUX_SWAPOPS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 776b67c66e82..3a3bbe000f85 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -650,7 +650,7 @@ config DEVICE_MIGRATION
config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
-config ARCH_ENABLE_THP_MIGRATION
+config ARCH_SUPPORTS_PMD_SOFTLEAF
bool
config HUGETLB_PAGE_SIZE_VARIABLE
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 18411fb09aab..507fbd1ae7e5 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -751,7 +751,7 @@ static void __init pmd_leaf_soft_dirty_tests(struct pgtable_debug_args *args)
pmd_t pmd;
if (!pgtable_supports_soft_dirty() ||
- !IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION))
+ !IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF))
return;
if (!has_transparent_hugepage())
@@ -819,7 +819,7 @@ static void __init pte_swap_tests(struct pgtable_debug_args *args)
WARN_ON(memcmp(&pte1, &pte2, sizeof(pte1)));
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
static void __init pmd_softleaf_tests(struct pgtable_debug_args *args)
{
swp_entry_t arch_entry;
@@ -837,9 +837,9 @@ static void __init pmd_softleaf_tests(struct pgtable_debug_args *args)
pmd2 = __swp_entry_to_pmd(arch_entry);
WARN_ON(memcmp(&pmd1, &pmd2, sizeof(pmd1)));
}
-#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#else /* !CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
static void __init pmd_softleaf_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
static void __init swap_migration_tests(struct pgtable_debug_args *args)
{
diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83d..cabf111f2ed2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -331,7 +331,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
return hmm_vma_fault(addr, end, required_fault, walk);
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start,
unsigned long end, unsigned long *hmm_pfns,
pmd_t pmd)
@@ -391,7 +391,7 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start,
return -EFAULT;
return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
}
-#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
static int hmm_vma_walk_pmd(pmd_t *pmdp,
unsigned long start,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7b76eef6617..af6a9c20131a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4861,7 +4861,7 @@ static int __init split_huge_pages_debugfs(void)
late_initcall(split_huge_pages_debugfs);
#endif
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
struct page *page)
{
diff --git a/mm/migrate.c b/mm/migrate.c
index d9b23909d716..6f6518960882 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -362,7 +362,7 @@ static bool remove_migration_pte(struct folio *folio,
idx = linear_page_index(vma, pvmw.address) - pvmw.pgoff;
new = folio_page(folio, idx);
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
/* PMD-mapped THP migration entry */
if (!pvmw.pte) {
VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
@@ -545,7 +545,7 @@ void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, p
}
#endif
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
{
spinlock_t *ptl;
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 87f079b64265..af336bcedeb3 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -771,7 +771,7 @@ int migrate_vma_setup(struct migrate_vma *args)
}
EXPORT_SYMBOL(migrate_vma_setup);
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
/**
* migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
* at @addr. folio is already allocated as a part of the migration process with
@@ -926,7 +926,7 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
return ret;
}
-#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#else /* !CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */
static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
unsigned long addr,
struct page *page,
@@ -947,7 +947,7 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
static unsigned long migrate_vma_nr_pages(unsigned long *src)
{
unsigned long nr = 1;
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
if (*src & MIGRATE_PFN_COMPOUND)
nr = HPAGE_PMD_NR;
#else
diff --git a/mm/rmap.c b/mm/rmap.c
index b93caabd186f..0fb7a1b82cf3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2472,7 +2472,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
page_vma_mapped_walk_restart(&pvmw);
continue;
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF
pmdval = pmdp_get(pvmw.pmd);
if (likely(pmd_present(pmdval)))
pfn = pmd_pfn(pmdval);
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 07/16] mm: add PMD swap entry detection support
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (5 preceding siblings ...)
2026-06-02 14:24 ` [v2 06/16] mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 08/16] mm: add PMD swap entry splitting support Usama Arif
` (9 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Currently when a PMD-mapped THP is swapped out, the PMD is always split
into 512 PTE-level swap entries. To preserve huge page information
across swap cycles, later patches will install a single PMD-level swap
entry instead. This patch adds the infrastructure to detect those
entries.
Teach the softleaf layer to recognise PMD swap entries:
pmd_is_swap_entry() detects them and softleaf_is_valid_pmd_entry()
accepts them as a valid non-present type. Clear the exclusive overlay
bit in softleaf_from_pmd() before decoding, matching how soft_dirty and
uffd_wp bits are already stripped.
Add pmd_swp_mkexclusive(), pmd_swp_exclusive(), and
pmd_swp_clear_exclusive() helpers to each architecture that supports
THP migration (x86, arm64, s390, riscv, loongarch, powerpc),
mirroring the existing PTE swap exclusive helpers in each arch's
pgtable.h.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
arch/arm64/include/asm/pgtable.h | 4 ++++
arch/loongarch/include/asm/pgtable.h | 17 ++++++++++++++
arch/powerpc/include/asm/book3s/64/pgtable.h | 15 ++++++++++++
arch/riscv/include/asm/pgtable.h | 15 ++++++++++++
arch/s390/include/asm/pgtable.h | 15 ++++++++++++
arch/x86/include/asm/pgtable.h | 15 ++++++++++++
include/linux/leafops.h | 24 ++++++++++++++++----
7 files changed, 100 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 623099303c7b..2f0d95ce341d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -601,6 +601,10 @@ static inline int pmd_protnone(pmd_t pmd)
#define pmd_swp_clear_uffd_wp(pmd) \
pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)))
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+#define pmd_swp_exclusive(pmd) pte_swp_exclusive(pmd_pte(pmd))
+#define pmd_swp_mkexclusive(pmd) pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)))
+#define pmd_swp_clear_exclusive(pmd) \
+ pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)))
#define pmd_write(pmd) pte_write(pmd_pte(pmd))
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 2a0b63ae421f..33bdfa1e8bbb 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -357,6 +357,23 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
return pte;
}
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+ pmd_val(pmd) |= _PAGE_SWP_EXCLUSIVE;
+ return pmd;
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+ return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+ pmd_val(pmd) &= ~_PAGE_SWP_EXCLUSIVE;
+ return pmd;
+}
+
#define pte_none(pte) (!(pte_val(pte) & ~_PAGE_GLOBAL))
#define pte_present(pte) (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROTNONE))
#define pte_no_exec(pte) (pte_val(pte) & _PAGE_NO_EXEC)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6f30aa8a6490..e8467ea4f4de 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -699,6 +699,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE));
}
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+ return __pmd_raw(pmd_raw(pmd) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+ return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+ return __pmd_raw(pmd_raw(pmd) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE));
+}
+
static inline bool check_pte_access(unsigned long access, unsigned long ptev)
{
/*
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 52cfd7df228b..0717b514a615 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -920,6 +920,21 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+ return pte_swp_exclusive(pmd_pte(pmd));
+}
+
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+ return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd)));
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+ return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd)));
+}
+
#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
static inline bool pmd_soft_dirty(pmd_t pmd)
{
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 83d4516825f0..88f2465fc482 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -870,6 +870,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
}
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+ return set_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline bool pmd_swp_exclusive(pmd_t pmd)
+{
+ return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+ return clear_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
static inline int pte_soft_dirty(pte_t pte)
{
return pte_val(pte) & _PAGE_SOFT_DIRTY;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6efc7980c95a..c5c273bfcd04 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1517,6 +1517,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
}
+static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pmd_swp_exclusive(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_SWP_EXCLUSIVE);
+}
+
#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 88888daeb018..988e59c6fa8a 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -102,6 +102,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd)
pmd = pmd_swp_clear_soft_dirty(pmd);
if (pmd_swp_uffd_wp(pmd))
pmd = pmd_swp_clear_uffd_wp(pmd);
+ if (pmd_swp_exclusive(pmd))
+ pmd = pmd_swp_clear_exclusive(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
/* Temporary until swp_entry_t eliminated. */
@@ -634,18 +636,30 @@ static inline bool pmd_is_migration_entry(pmd_t pmd)
*/
static inline bool softleaf_is_valid_pmd_entry(softleaf_t entry)
{
- /* Only device private, migration entries valid for PMD. */
+ /* Device private, migration, and swap entries valid for PMD. */
return softleaf_is_device_private(entry) ||
- softleaf_is_migration(entry);
+ softleaf_is_migration(entry) ||
+ softleaf_is_swap(entry);
+}
+
+/**
+ * pmd_is_swap_entry() - Does this PMD entry encode an actual swap entry?
+ * @pmd: PMD entry.
+ *
+ * Returns: true if the PMD encodes a swap entry, otherwise false.
+ */
+static inline bool pmd_is_swap_entry(pmd_t pmd)
+{
+ return softleaf_is_swap(softleaf_from_pmd(pmd));
}
/**
* pmd_is_valid_softleaf() - Is this PMD entry a valid softleaf entry?
* @pmd: PMD entry.
*
- * PMD leaf entries are valid only if they are device private or migration
- * entries. This function asserts that a PMD leaf entry is valid in this
- * respect.
+ * PMD leaf entries are valid only if they are device private, migration,
+ * or swap entries. This function asserts that a PMD leaf entry is valid
+ * in this respect.
*
* Returns: true if the PMD entry is a valid leaf entry, otherwise false.
*/
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 08/16] mm: add PMD swap entry splitting support
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (6 preceding siblings ...)
2026-06-02 14:24 ` [v2 07/16] mm: add PMD swap entry detection support Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 09/16] mm: handle PMD swap entries in fork path Usama Arif
` (8 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Add a swap branch in __split_huge_pmd_locked() that splits a PMD swap
entry into 512 PTE swap entries. Unlike migration splits, no folio
reference is needed because swap entries point to swap slots, not
pages. Each PTE inherits the correct sub-slot offset and preserves
soft_dirty, uffd_wp, and exclusive flags.
This branch is reached from the explicit __split_huge_pmd() callers
that hit a non-present PMD: partial-range mprotect / munmap, the
wp_huge_pmd() PMD-COW fallback, and the swap-in / swapoff fallbacks
added in later patches when the cached folio is no longer PMD-sized.
page_vma_mapped_walk() does not iterate PMD swap entries, so
try_to_unmap_one() and try_to_migrate_one() do not reach this branch
and freeze=true cannot occur in this branch today. page and folio
are therefore left uninitialized in the swap branch; a
VM_WARN_ON_ONCE(freeze) catches any future caller that breaks this
invariant before the freeze path dereferences page_to_pfn(page + i)
or put_page(page).
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/huge_memory.c | 27 ++++++++++++++++++++++++++-
1 file changed, 26 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index af6a9c20131a..7cb1afde46e1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3144,6 +3144,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
vma, haddr, rmap_flags);
}
+ } else if (pmd_is_swap_entry(*pmd)) {
+ VM_WARN_ON_ONCE(freeze);
+ old_pmd = *pmd;
+ soft_dirty = pmd_swp_soft_dirty(old_pmd);
+ uffd_wp = pmd_swp_uffd_wp(old_pmd);
+ anon_exclusive = pmd_swp_exclusive(old_pmd);
} else {
/*
* Up to this point the pmd is present and huge and userland has
@@ -3280,6 +3286,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
set_pte_at(mm, addr, pte + i, entry);
}
+ } else if (pmd_is_swap_entry(old_pmd)) {
+ softleaf_t sl_entry = softleaf_from_pmd(old_pmd);
+ pte_t swp_pte;
+ swp_entry_t sub_entry;
+
+ for (i = 0, addr = haddr; i < HPAGE_PMD_NR;
+ i++, addr += PAGE_SIZE) {
+ sub_entry = swp_entry(swp_type(sl_entry),
+ swp_offset(sl_entry) + i);
+ swp_pte = swp_entry_to_pte(sub_entry);
+ if (soft_dirty)
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (uffd_wp)
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ if (anon_exclusive)
+ swp_pte = pte_swp_mkexclusive(swp_pte);
+ VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+ set_pte_at(mm, addr, pte + i, swp_pte);
+ }
} else {
pte_t entry;
@@ -3303,7 +3328,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}
pte_unmap(pte);
- if (!pmd_is_migration_entry(*pmd))
+ if (!pmd_is_migration_entry(*pmd) && !pmd_is_swap_entry(*pmd))
folio_remove_rmap_pmd(folio, page, vma);
if (freeze)
put_page(page);
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 09/16] mm: handle PMD swap entries in fork path
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (7 preceding siblings ...)
2026-06-02 14:24 ` [v2 08/16] mm: add PMD swap entry splitting support Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 10/16] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
` (7 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Teach copy_huge_pmd()/copy_huge_non_present_pmd() about swap entries,
mirroring copy_nonpresent_pte().
swap_dup_entry_direct() gains a nr parameter (and is renamed to
swap_dup_entries_direct()) so it can duplicate a contiguous range of
swap slots in one call, matching the existing
swap_put_entries_direct(entry, nr) API. Existing callers pass 1.
copy_huge_non_present_pmd() "copies" PMD swap entries during fork
instead of splitting, preserving the THP. This mirrors
copy_nonpresent_pte() which duplicates the swap slot refcount,
clears the exclusive bit on the source, and adds the destination
mm to mmlist. If swap_dup_entries_direct() fails (GFP_ATOMIC table
alloc), copy_huge_pmd() retries after swap_retry_table_alloc() with
GFP_KERNEL, matching the PTE retry in copy_pte_range(). The PMD is
stable across the retry because dup_mmap() holds write mmap_lock on
both mm_structs.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/swap.h | 4 ++--
mm/huge_memory.c | 52 +++++++++++++++++++++++++++++++++++++++-----
mm/memory.c | 2 +-
mm/swapfile.c | 7 +++---
4 files changed, 53 insertions(+), 12 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6d72778e6cc3..8a5ec5f0a7c7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -458,7 +458,7 @@ sector_t swap_folio_sector(struct folio *folio);
* All entries must be allocated by folio_alloc_swap(). And they must have
* a swap count > 1. See comments of folio_*_swap helpers for more info.
*/
-int swap_dup_entry_direct(swp_entry_t entry);
+int swap_dup_entries_direct(swp_entry_t entry, int nr);
void swap_put_entries_direct(swp_entry_t entry, int nr);
/*
@@ -502,7 +502,7 @@ static inline void free_swap_cache(struct folio *folio)
{
}
-static inline int swap_dup_entry_direct(swp_entry_t ent)
+static inline int swap_dup_entries_direct(swp_entry_t ent, int nr)
{
return 0;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7cb1afde46e1..a525417d13f6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1806,7 +1806,7 @@ bool touch_pmd(struct vm_area_struct *vma, unsigned long addr,
return false;
}
-static void copy_huge_non_present_pmd(
+static int copy_huge_non_present_pmd(
struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
@@ -1852,14 +1852,35 @@ static void copy_huge_non_present_pmd(
*/
folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
dst_vma, src_vma);
+ } else if (softleaf_is_swap(entry)) {
+ int err;
+
+ /*
+ * PMD swap entry: duplicate swap references and clear
+ * exclusive on source, matching copy_nonpresent_pte().
+ */
+ err = swap_dup_entries_direct(entry, HPAGE_PMD_NR);
+ if (err < 0)
+ return err;
+
+ mm_prepare_for_swap_entries(dst_mm);
+
+ if (pmd_swp_exclusive(pmd)) {
+ pmd = pmd_swp_clear_exclusive(pmd);
+ set_pmd_at(src_mm, addr, src_pmd, pmd);
+ }
}
- add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ if (softleaf_is_swap(entry))
+ add_mm_counter(dst_mm, MM_SWAPENTS, HPAGE_PMD_NR);
+ else
+ add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
if (!userfaultfd_wp(dst_vma))
pmd = pmd_swp_clear_uffd_wp(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+ return 0;
}
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -1900,6 +1921,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (unlikely(!pgtable))
goto out;
+retry:
dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -1907,10 +1929,28 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
- if (unlikely(thp_migration_supported() &&
- pmd_is_valid_softleaf(pmd))) {
- copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr,
- dst_vma, src_vma, pmd, pgtable);
+ if (unlikely(pmd_is_valid_softleaf(pmd))) {
+ ret = copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
+ addr, dst_vma, src_vma, pmd,
+ pgtable);
+ if (ret) {
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ /*
+ * For PMD swap entries -ENOMEM means the per-cluster
+ * swap-extend table couldn't be GFP_ATOMIC-allocated.
+ * try the GFP_KERNEL fallback once before giving up.
+ */
+ if (ret == -ENOMEM) {
+ softleaf_t entry = softleaf_from_pmd(pmd);
+
+ if (softleaf_is_swap(entry) &&
+ !swap_retry_table_alloc(entry, GFP_KERNEL))
+ goto retry;
+ }
+ pte_free(dst_mm, pgtable);
+ goto out;
+ }
ret = 0;
goto out_unlock;
}
diff --git a/mm/memory.c b/mm/memory.c
index 137f34c3fd32..5cf02e394c92 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -950,7 +950,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct page *page;
if (likely(softleaf_is_swap(entry))) {
- if (swap_dup_entry_direct(entry) < 0)
+ if (swap_dup_entries_direct(entry, 1) < 0)
return -EIO;
mm_prepare_for_swap_entries(dst_mm);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3d126602a1e..37408905490e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3899,8 +3899,9 @@ void si_swapinfo(struct sysinfo *val)
}
/*
- * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
+ * swap_dup_entries_direct() - Increase reference count of swap entries by one.
* @entry: first swap entry from which we want to increase the refcount.
+ * @nr: number of contiguous swap entries to duplicate.
*
* Returns 0 for success, or -ENOMEM if the extend table is required
* but could not be atomically allocated. Returns -EINVAL if the swap
@@ -3912,7 +3913,7 @@ void si_swapinfo(struct sysinfo *val)
* Also the swap entry must have a count >= 1. Otherwise folio_dup_swap should
* be used.
*/
-int swap_dup_entry_direct(swp_entry_t entry)
+int swap_dup_entries_direct(swp_entry_t entry, int nr)
{
struct swap_info_struct *si;
@@ -3929,7 +3930,7 @@ int swap_dup_entry_direct(swp_entry_t entry)
*/
VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry));
- return swap_dup_entries_cluster(si, swp_offset(entry), 1);
+ return swap_dup_entries_cluster(si, swp_offset(entry), nr);
}
#if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 10/16] mm: swap in PMD swap entries as whole THPs during swapoff
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (8 preceding siblings ...)
2026-06-02 14:24 ` [v2 09/16] mm: handle PMD swap entries in fork path Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 11/16] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
` (6 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Add unuse_pmd() and call it from unuse_pmd_range() to swap in
PMD-level swap entries as whole THPs during swapoff. This mirrors
the existing unuse_pte_range() but operates at PMD granularity.
If the PMD-order folio cannot be allocated, the cached folio is no
longer PMD-sized (e.g. split in the swap cache by
deferred_split_scan() or memory_failure() while the PMD swap entry
was installed), or the folio is not uptodate, the PMD swap entry is
split into PTE-level entries via __split_huge_pmd() and a non-zero
error is returned so unuse_pmd_range() falls through to
unuse_pte_range(), which handles the individual entries at order-0.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/swapfile.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 145 insertions(+)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 37408905490e..56454e486324 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,7 @@
#include <linux/suspend.h>
#include <linux/zswap.h>
#include <linux/plist.h>
+#include <linux/huge_mm.h>
#include <asm/tlbflush.h>
#include <linux/leafops.h>
@@ -2641,6 +2642,138 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
return 0;
}
+/*
+ * unuse_pmd - Map a locked folio at PMD granularity during swapoff.
+ *
+ * The caller provides a locked, swapped-in folio. Returns 0 on success
+ * (PMD was mapped). Returns -EAGAIN if the swap cache folio no longer
+ * matches the entry or the PMD changed under the lock (try_to_unuse will
+ * rescan). Returns -EIO if the folio is not uptodate; in that case the
+ * PMD is split so unuse_pte_range() can handle individual pages.
+ */
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long addr, softleaf_t entry,
+ struct folio *folio)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ pmd_t new_pmd, old_pmd;
+ spinlock_t *ptl;
+ rmap_t rmap_flags = RMAP_NONE;
+ bool exclusive;
+
+ if (unlikely(!folio_matches_swap_entry(folio, entry)))
+ return -EAGAIN;
+
+ if (unlikely(!folio_test_uptodate(folio))) {
+ __split_huge_pmd(vma, pmd, addr, false);
+ return -EIO;
+ }
+
+ page = folio_page(folio, 0);
+
+ ptl = pmd_lock(mm, pmd);
+ old_pmd = pmdp_get(pmd);
+
+ if (!pmd_is_swap_entry(old_pmd) ||
+ softleaf_from_pmd(old_pmd).val != entry.val) {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }
+
+ exclusive = pmd_swp_exclusive(old_pmd);
+
+ /*
+ * Some architectures may have to restore extra metadata to the folio
+ * when reading from swap. This metadata may be indexed by swap entry
+ * so this must be called before folio_put_swap().
+ */
+ arch_swap_restore(folio_swap(entry, folio), folio);
+
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+ new_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+ new_pmd = pmd_mkold(new_pmd);
+ if (pmd_swp_soft_dirty(old_pmd))
+ new_pmd = pmd_mksoft_dirty(new_pmd);
+ if (pmd_swp_uffd_wp(old_pmd))
+ new_pmd = pmd_mkuffd_wp(new_pmd);
+
+ if (exclusive)
+ rmap_flags |= RMAP_EXCLUSIVE;
+
+ folio_get(folio);
+ if (!folio_test_anon(folio))
+ folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
+ else
+ folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags);
+
+ set_pmd_at(mm, addr, pmd, new_pmd);
+ folio_put_swap(folio, NULL);
+
+ spin_unlock(ptl);
+
+ folio_free_swap(folio);
+ return 0;
+}
+
+/*
+ * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success.
+ * Returns -ENOMEM if the PMD-order folio could not be allocated/charged,
+ * -EIO if swap-in failed, or -EAGAIN if the cached folio is no longer
+ * PMD-sized; in all of these the PMD is split so the caller can fall
+ * back to unuse_pte_range(). Otherwise propagates the error from
+ * unuse_pmd().
+ */
+static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long addr, softleaf_t entry)
+{
+ struct folio *folio;
+ int ret;
+
+ folio = swap_cache_get_folio(entry);
+ if (!folio) {
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = addr,
+ .real_address = addr,
+ .pmd = pmd,
+ };
+
+ folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
+ BIT(HPAGE_PMD_ORDER), &vmf, NULL, 0);
+ if (IS_ERR_OR_NULL(folio)) {
+ ret = -ENOMEM;
+ goto split_fallback;
+ }
+ }
+
+ folio_lock(folio);
+ folio_wait_writeback(folio);
+ /*
+ * If the cached folio is no longer PMD-sized (e.g. split in the
+ * swap cache by deferred_split_scan() or memory_failure() while
+ * the PMD swap entry was installed), the PMD swap entry no longer
+ * maps a single contiguous folio. Split the PMD swap entry so
+ * unuse_pte_range() can swap the per-slot folios in individually.
+ */
+ if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+ folio_unlock(folio);
+ folio_put(folio);
+ ret = -EAGAIN;
+ goto split_fallback;
+ }
+ ret = unuse_pmd(vma, pmd, addr, entry, folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ return ret;
+
+split_fallback:
+ __split_huge_pmd(vma, pmd, addr, false);
+ return ret;
+}
+
static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
unsigned int type)
@@ -2653,6 +2786,18 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
do {
cond_resched();
next = pmd_addr_end(addr, end);
+
+ pmd_t pmdval = pmdp_get(pmd);
+
+ if (pmd_is_swap_entry(pmdval)) {
+ softleaf_t sl = softleaf_from_pmd(pmdval);
+
+ if (swp_type(sl) == type) {
+ if (!unuse_pmd_entry(vma, pmd, addr, sl))
+ continue;
+ }
+ }
+
ret = unuse_pte_range(vma, pmd, addr, next, type);
if (ret)
return ret;
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 11/16] mm: handle PMD swap entries in non-present PMD walkers
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (9 preceding siblings ...)
2026-06-02 14:24 ` [v2 10/16] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 12/16] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
` (5 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Teach the remaining non-present PMD walkers about swap entries,
mirroring the PTE-level equivalents.
smaps_pmd_entry() accounts swap and swap_pss via a new shared
smaps_account_swap() helper used by both PTE and PMD paths.
zap_huge_pmd() frees swap slots via swap_put_entries_direct(),
matching zap_nonpresent_ptes().
change_non_present_huge_pmd() skips write-permission changes for swap
entries and only updates uffd_wp, matching change_softleaf_pte().
move_soft_dirty_pmd(), clear_soft_dirty_pmd(), and make_uffd_wp_pmd(),
pagemap_pmd_range_thp() and change_huge_pmd() handle swap entries
alongside migration entries.
madvise_cold_or_pageout_pte_range() extends its non-present PMD
VM_BUG_ON to allow swap entries; without this, hitting a PMD swap
entry on a DEBUG_VM kernel would BUG().
mincore_pte_range() routes the pmd_trans_huge_lock() branch through
mincore_swap() for non-present PMDs, matching how the PTE path
already calls mincore_swap() for non-present PTEs. Without this a
swapped-out PMD-mapped THP would be reported as resident, because
pmd_is_huge() (and therefore pmd_trans_huge_lock()) accepts any
non-present non-none PMD and the old branch unconditionally did
memset(vec, 1, nr). mincore_swap() returns 1 for migration /
device-private entries (preserving the prior behavior for those)
and checks swap-cache residency for swap entries.
queue_folios_pmd() in mempolicy silently skips swap entries, matching
the PTE walker which only counts migration entries as failures.
Without this, mbind(MPOL_MF_STRICT) would spuriously return -EIO on
a swapped-out THP.
madvise_free_huge_pmd() handles PMD swap entries directly: for a
full-range MADV_FREE it clears the PMD, frees the deposited page
table, and releases the swap slots; for a partial range it splits to
PTE swap entries. Without this, MADV_FREE silently becomes a no-op
on swapped-out THPs, leaking swap slots.
check_pmd_state() in khugepaged returns SCAN_PMD_MAPPED for PMD swap
entries, treating a swapped-out THP as still being a THP from
khugepaged's perspective and matching the existing migration-entry
handling.
hmm_vma_handle_absent_pmd() faults in PMD swap entries via
hmm_vma_fault() instead of returning -EFAULT. The first per-page
handle_mm_fault() call triggers do_huge_pmd_swap_page(), which maps
the entire folio; subsequent calls become harmless
huge_pmd_set_accessed() and the walker retries with a present PMD.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
fs/proc/task_mmu.c | 43 +++++++++++++++++++++-------------
mm/hmm.c | 3 ++-
mm/huge_memory.c | 58 +++++++++++++++++++++++++++++++++++-----------
mm/khugepaged.c | 6 +++++
mm/madvise.c | 5 ++--
mm/mempolicy.c | 2 ++
mm/mincore.c | 14 ++++++++++-
7 files changed, 98 insertions(+), 33 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1fb5acd88ad0..f85899eec80f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1046,6 +1046,23 @@ static void smaps_pte_hole_lookup(unsigned long addr, struct mm_walk *walk)
#endif
}
+static void smaps_account_swap(struct mem_size_stats *mss,
+ softleaf_t entry, unsigned long size)
+{
+ int mapcount;
+
+ mss->swap += size;
+ mapcount = swp_swapcount(entry);
+ if (mapcount >= 2) {
+ u64 pss_delta = (u64)size << PSS_SHIFT;
+
+ do_div(pss_delta, mapcount);
+ mss->swap_pss += pss_delta;
+ } else {
+ mss->swap_pss += (u64)size << PSS_SHIFT;
+ }
+}
+
static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct mm_walk *walk)
{
@@ -1067,18 +1084,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
const softleaf_t entry = softleaf_from_pte(ptent);
if (softleaf_is_swap(entry)) {
- int mapcount;
-
- mss->swap += PAGE_SIZE;
- mapcount = swp_swapcount(entry);
- if (mapcount >= 2) {
- u64 pss_delta = (u64)PAGE_SIZE << PSS_SHIFT;
-
- do_div(pss_delta, mapcount);
- mss->swap_pss += pss_delta;
- } else {
- mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
- }
+ smaps_account_swap(mss, entry, PAGE_SIZE);
} else if (softleaf_has_pfn(entry)) {
if (softleaf_is_device_private(entry))
present = true;
@@ -1108,9 +1114,13 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
if (pmd_present(*pmd)) {
page = vm_normal_page_pmd(vma, addr, *pmd);
present = true;
- } else if (unlikely(thp_migration_supported())) {
+ } else {
const softleaf_t entry = softleaf_from_pmd(*pmd);
+ if (softleaf_is_swap(entry)) {
+ smaps_account_swap(mss, entry, HPAGE_PMD_SIZE);
+ return;
+ }
if (softleaf_has_pfn(entry))
page = softleaf_to_page(entry);
}
@@ -1752,7 +1762,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
pmd = pmd_clear_soft_dirty(pmd);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
- } else if (pmd_is_migration_entry(pmd)) {
+ } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) {
pmd = pmd_swp_clear_soft_dirty(pmd);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
}
@@ -2112,7 +2122,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
flags |= PM_UFFD_WP;
if (pm->show_pfn)
frame = pmd_pfn(pmd) + idx;
- } else if (thp_migration_supported()) {
+ } else if (pmd_is_swap_entry(pmd) ||
+ (thp_migration_supported() && pmd_is_migration_entry(pmd))) {
const softleaf_t entry = softleaf_from_pmd(pmd);
unsigned long offset;
@@ -2550,7 +2561,7 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
old = pmdp_invalidate_ad(vma, addr, pmdp);
pmd = pmd_mkuffd_wp(old);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
- } else if (pmd_is_migration_entry(pmd)) {
+ } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) {
pmd = pmd_swp_mkuffd_wp(pmd);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
}
diff --git a/mm/hmm.c b/mm/hmm.c
index cabf111f2ed2..b5fa7549c183 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -370,7 +370,8 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start,
required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns,
npages, 0);
if (required_fault) {
- if (softleaf_is_device_private(entry))
+ if (softleaf_is_device_private(entry) ||
+ softleaf_is_swap(entry))
return hmm_vma_fault(addr, end, required_fault, walk);
else
return -EFAULT;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a525417d13f6..1d6d3817046d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2314,6 +2314,14 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
return 0;
}
+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+ pgtable_t pgtable;
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pte_free(mm, pgtable);
+ mm_dec_nr_ptes(mm);
+}
/*
* Return true if we do MADV_FREE successfully on entire pmd page.
* Otherwise, return false.
@@ -2338,8 +2346,23 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
goto out;
if (unlikely(!pmd_present(orig_pmd))) {
+ if (pmd_is_swap_entry(orig_pmd)) {
+ if (next - addr != HPAGE_PMD_SIZE) {
+ spin_unlock(ptl);
+ __split_huge_pmd(vma, pmd, addr, false);
+ goto out_unlocked;
+ }
+ softleaf_t sl = softleaf_from_pmd(orig_pmd);
+
+ pmdp_huge_get_and_clear(mm, addr, pmd);
+ zap_deposited_table(mm, pmd);
+ spin_unlock(ptl);
+ swap_put_entries_direct(sl, HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+ return true;
+ }
VM_BUG_ON(thp_migration_supported() &&
- !pmd_is_migration_entry(orig_pmd));
+ !pmd_is_migration_entry(orig_pmd));
goto out;
}
@@ -2388,15 +2411,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return ret;
}
-static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
-{
- pgtable_t pgtable;
-
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pte_free(mm, pgtable);
- mm_dec_nr_ptes(mm);
-}
-
static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t pmdval, struct folio *folio, bool is_present)
{
@@ -2489,6 +2503,16 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
arch_check_zapped_pmd(vma, orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ if (pmd_is_swap_entry(orig_pmd)) {
+ softleaf_t sl = softleaf_from_pmd(orig_pmd);
+
+ zap_deposited_table(mm, pmd);
+ spin_unlock(ptl);
+ swap_put_entries_direct(sl, HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+ return true;
+ }
+
is_present = pmd_present(orig_pmd);
folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
has_deposit = has_deposited_pgtable(vma, orig_pmd, folio);
@@ -2521,7 +2545,8 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
static pmd_t move_soft_dirty_pmd(pmd_t pmd)
{
if (pgtable_supports_soft_dirty()) {
- if (unlikely(pmd_is_migration_entry(pmd)))
+ if (unlikely(pmd_is_migration_entry(pmd) ||
+ pmd_is_swap_entry(pmd)))
pmd = pmd_swp_mksoft_dirty(pmd);
else if (pmd_present(pmd))
pmd = pmd_mksoft_dirty(pmd);
@@ -2601,7 +2626,14 @@ static void change_non_present_huge_pmd(struct mm_struct *mm,
pmd_t newpmd;
VM_WARN_ON(!pmd_is_valid_softleaf(*pmd));
- if (softleaf_is_migration_write(entry)) {
+
+ /*
+ * PMD swap entries don't encode write permission in the entry type,
+ * so only uffd_wp flag changes apply. No folio lookup needed.
+ */
+ if (softleaf_is_swap(entry)) {
+ newpmd = *pmd;
+ } else if (softleaf_is_migration_write(entry)) {
const struct folio *folio = softleaf_to_folio(entry);
/*
@@ -2660,7 +2692,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (!ptl)
return 0;
- if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) {
+ if (pmd_is_valid_softleaf(*pmd)) {
change_non_present_huge_pmd(mm, addr, pmd, uffd_wp,
uffd_wp_resolve);
goto unlock;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8ffb47f1e845..bb63700519ab 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1111,6 +1111,12 @@ static inline enum scan_result check_pmd_state(pmd_t *pmd)
*/
if (pmd_is_migration_entry(pmde))
return SCAN_PMD_MAPPED;
+ /*
+ * A PMD-mapped THP that has been swapped out is still a THP from
+ * khugepaged's perspective; treat it like a present huge PMD.
+ */
+ if (pmd_is_swap_entry(pmde))
+ return SCAN_PMD_MAPPED;
if (!pmd_present(pmde))
return SCAN_NO_PTE_TABLE;
if (pmd_trans_huge(pmde))
diff --git a/mm/madvise.c b/mm/madvise.c
index cd9bb077072c..00539022f804 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -390,7 +390,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (unlikely(!pmd_present(orig_pmd))) {
VM_BUG_ON(thp_migration_supported() &&
- !pmd_is_migration_entry(orig_pmd));
+ !pmd_is_migration_entry(orig_pmd) &&
+ !pmd_is_swap_entry(orig_pmd));
goto huge_unlock;
}
@@ -666,7 +667,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
int nr, max_nr;
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*pmd))
+ if (pmd_trans_huge(*pmd) || pmd_is_swap_entry(*pmd))
if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
return 0;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36699fabd3c2..25d929b2037e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -658,6 +658,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
qp->nr_failed++;
return;
}
+ if (unlikely(pmd_is_swap_entry(*pmd)))
+ return;
folio = pmd_folio(*pmd);
if (is_huge_zero_folio(folio)) {
walk->action = ACTION_CONTINUE;
diff --git a/mm/mincore.c b/mm/mincore.c
index e5d13eea9234..3fee8a7b9d9d 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -172,7 +172,19 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
- memset(vec, 1, nr);
+ if (pmd_present(*pmd)) {
+ memset(vec, 1, nr);
+ } else {
+ /*
+ * Non-present PMD: migration, device-private, or PMD
+ * swap entry. Route through mincore_swap() the same way
+ * the PTE path does -- the swap entry covers all 512
+ * slots, so the whole vec gets the same answer.
+ */
+ softleaf_t entry = softleaf_from_pmd(*pmd);
+
+ memset(vec, mincore_swap(entry, false), nr);
+ }
spin_unlock(ptl);
goto out;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 12/16] mm: handle PMD swap entries in MADV_WILLNEED
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (10 preceding siblings ...)
2026-06-02 14:24 ` [v2 11/16] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
` (4 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
swapin_walk_pmd_entry() walks PTEs and skips non-present PMDs, so
MADV_WILLNEED is a no-op on a PMD swap entry. Read the whole 2 MB
folio in at PMD order via swapin_sync(BIT(HPAGE_PMD_ORDER)) so the
subsequent fault hits do_huge_pmd_swap_page() and restores the THP
mapping; an order-0 read-ahead would force the fault to split.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/madvise.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/mm/madvise.c b/mm/madvise.c
index 00539022f804..25f40542b951 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -193,6 +193,46 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
spinlock_t *ptl;
unsigned long addr;
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ pmd_t pmdval = *pmd;
+
+ if (pmd_is_swap_entry(pmdval)) {
+ softleaf_t entry = softleaf_from_pmd(pmdval);
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = start,
+ .real_address = start,
+ .pmd = pmd,
+ };
+ struct swap_info_struct *si;
+ struct folio *folio;
+
+ /*
+ * Pin the swap device under the PMD lock so the
+ * lookup is atomic with the PMD-swap-entry
+ * observation; swapin_sync() requires its caller to
+ * keep the device valid for the duration of the call.
+ */
+ si = get_swap_device(entry);
+ spin_unlock(ptl);
+ if (!si) {
+ cond_resched();
+ return 0;
+ }
+
+ folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
+ BIT(HPAGE_PMD_ORDER), &vmf,
+ NULL, 0);
+ if (!IS_ERR_OR_NULL(folio))
+ folio_put(folio);
+ put_swap_device(si);
+ cond_resched();
+ return 0;
+ }
+ spin_unlock(ptl);
+ }
+
for (addr = start; addr < end; addr += PAGE_SIZE) {
pte_t pte;
softleaf_t entry;
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (11 preceding siblings ...)
2026-06-02 14:24 ` [v2 12/16] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 14/16] mm: handle PMD swap entry faults on swap-in Usama Arif
` (3 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
move_pages_huge_pmd() returned -ENOENT for any non-trans_huge,
non-migration PMD, which fails aligned UFFDIO_MOVE on a swapped-out
THP -- the PMD swap entry is a perfectly valid mapping that should
move whole. Splitting via the move_pages_ptes() fallback isn't a
substitute either: __split_huge_pmd_locked() splits a PMD swap entry
into HPAGE_PMD_NR PTE swap entries pointing at the same swap-cache
folio, but move_swap_pte() refuses any swap-cache folio that is still
large and returns -EBUSY.
Add move_swap_pmd(), modeled on move_swap_pte(), that moves the swap
entry whole-PMD and re-anchors the swap-cache folio's anon rmap to
the destination VMA. Reject !pmd_swp_exclusive() entries with -EBUSY
to preserve UFFDIO_MOVE's single-owner semantics, propagate
soft-dirty, and carry the deposited page table across with the
entry.
The dispatcher in move_pages_huge_pmd() now waits for migration on a
PMD migration entry (matching the PTE path) and routes PMD swap
entries through move_swap_pmd() after pinning the swap device,
fetching and locking any cached folio, and arming an mmu_notifier
range so secondary MMUs see the move.
If the swap-cache folio was split (e.g. by deferred_split_scan or
memory_failure) between swap-out and UFFDIO_MOVE, src_folio is no
longer PMD-sized but the PMD swap entry still covers all 512 slots.
Moving the entry whole would only re-anchor one folio's anon rmap,
leaving the other 511 with a stale anon_vma. Return -EBUSY in this
case, matching move_pages_pte()'s rejection of large folios, so the
caller falls back to PTE-level moves.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/huge_memory.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 112 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6d3817046d..f1379c8a92e5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2812,6 +2812,62 @@ int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
#endif
#ifdef CONFIG_USERFAULTFD
+/*
+ * Move a PMD-level swap entry from src_pmd to dst_pmd. Both PMD locks are
+ * acquired here; src_folio (if present) must already be locked. The deposited
+ * page table backing the source THP is moved across with the entry.
+ */
+static int move_swap_pmd(struct mm_struct *mm, struct vm_area_struct *dst_vma,
+ unsigned long dst_addr, unsigned long src_addr,
+ pmd_t *dst_pmd, pmd_t *src_pmd,
+ pmd_t orig_dst_pmd, pmd_t orig_src_pmd,
+ spinlock_t *dst_ptl, spinlock_t *src_ptl,
+ struct folio *src_folio, swp_entry_t entry)
+{
+ pgtable_t src_pgtable;
+ pmd_t moved_pmd;
+
+ /*
+ * The folio may have been freed and reused for a different swap entry
+ * while it was unlocked. Re-verify the association.
+ */
+ if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+ entry.val != src_folio->swap.val))
+ return -EAGAIN;
+
+ double_pt_lock(dst_ptl, src_ptl);
+
+ if (!pmd_same(*src_pmd, orig_src_pmd) ||
+ !pmd_same(*dst_pmd, orig_dst_pmd)) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
+
+ /*
+ * If the folio is in the swap cache, re-anchor its anon rmap to the
+ * destination VMA so a future swap-in fault at dst_addr finds it.
+ * Otherwise, re-check that no folio was newly inserted under us.
+ */
+ if (src_folio) {
+ folio_move_anon_rmap(src_folio, dst_vma);
+ src_folio->index = linear_page_index(dst_vma, dst_addr);
+ } else if (swap_cache_has_folio(entry)) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
+
+ moved_pmd = pmdp_huge_get_and_clear(mm, src_addr, src_pmd);
+ if (pgtable_supports_soft_dirty())
+ moved_pmd = pmd_swp_mksoft_dirty(moved_pmd);
+ set_pmd_at(mm, dst_addr, dst_pmd, moved_pmd);
+
+ src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+ pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+
+ double_pt_unlock(dst_ptl, src_ptl);
+ return 0;
+}
+
/*
* The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
* the caller, but it must return after releasing the page_table_lock.
@@ -2846,11 +2902,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
}
if (!pmd_trans_huge(src_pmdval)) {
- spin_unlock(src_ptl);
if (pmd_is_migration_entry(src_pmdval)) {
+ spin_unlock(src_ptl);
pmd_migration_entry_wait(mm, &src_pmdval);
return -EAGAIN;
}
+ if (pmd_is_swap_entry(src_pmdval)) {
+ swp_entry_t entry;
+ struct swap_info_struct *si;
+
+ /*
+ * UFFDIO_MOVE on anon mappings requires single-owner
+ * semantics; refuse to move a shared swap entry.
+ */
+ if (!pmd_swp_exclusive(src_pmdval)) {
+ spin_unlock(src_ptl);
+ return -EBUSY;
+ }
+
+ entry = softleaf_from_pmd(src_pmdval);
+ spin_unlock(src_ptl);
+
+ /* Pin the swap device against a racing swapoff. */
+ si = get_swap_device(entry);
+ if (unlikely(!si))
+ return -EAGAIN;
+
+ src_folio = swap_cache_get_folio(entry);
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
+ mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ if (src_folio) {
+ folio_lock(src_folio);
+ if (folio_nr_pages(src_folio) != HPAGE_PMD_NR) {
+ err = -EBUSY;
+ folio_unlock(src_folio);
+ folio_put(src_folio);
+ mmu_notifier_invalidate_range_end(&range);
+ put_swap_device(si);
+ return err;
+ }
+ }
+
+ dst_ptl = pmd_lockptr(mm, dst_pmd);
+ err = move_swap_pmd(mm, dst_vma, dst_addr, src_addr,
+ dst_pmd, src_pmd, dst_pmdval,
+ src_pmdval, dst_ptl, src_ptl,
+ src_folio, entry);
+
+ mmu_notifier_invalidate_range_end(&range);
+ if (src_folio) {
+ folio_unlock(src_folio);
+ folio_put(src_folio);
+ }
+ put_swap_device(si);
+ return err;
+ }
+ spin_unlock(src_ptl);
return -ENOENT;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 14/16] mm: handle PMD swap entry faults on swap-in
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (12 preceding siblings ...)
2026-06-02 14:24 ` [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 15/16] mm: install PMD swap entries on swap-out Usama Arif
` (2 subsequent siblings)
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Add do_huge_pmd_swap_page() and dispatch to it from __handle_mm_fault()
when vmf->orig_pmd encodes a swap entry. The handler resolves the
entire 2 MB mapping in one shot, mirroring do_swap_page() (PTE path)
at PMD granularity:
- Look up the folio in the swap cache; on a miss, allocate a
PMD-order folio via swap_cache_alloc_folio() and read from swap.
- After locking, re-validate that the folio still corresponds to our
entry and is still PMD-sized. Between the unlocked cache lookup
and the lock, a racing swap-in on the same entry may have removed
it from the cache via folio_free_swap(), or reclaim / memory_failure
/ deferred-split may have split the folio into smaller folios.
- Restore soft_dirty and uffd_wp from the swap PMD. Map writable
only when the entry was exclusive, the VMA permits writes, and
uffd-wp is not armed. Drop the exclusive marker when the cached
folio is under writeback to an SWP_STABLE_WRITES backend (zram,
encrypted) so the PMD is mapped read-only; a later write COWs
into a fresh folio rather than corrupting the in-flight writeback.
Mirrors do_swap_page().
- When the resulting PMD is read-only but the fault was a write,
update vmf->orig_pmd and call wp_huge_pmd() in the same handler
to COW immediately rather than forcing a second fault. Mask
VM_FAULT_FALLBACK from its return: a PMD-COW that splits to
PTE-level is normal, but the bit is part of VM_FAULT_ERROR and
arch fault handlers BUG() on it without SIGBUS/HWPOISON/SIGSEGV.
Requires exposing wp_huge_pmd() via mm/internal.h.
- Free the swap slot via should_try_to_free_swap() (hoisted from
mm/memory.c into mm/internal.h so PTE- and PMD-level swap-in
share the heuristic).
When PMD-order resources are unavailable (folio allocation fails,
the cached folio was split, memcg charge fails, or swapin_folio()
races) split the PMD swap entry into 512 PTE swap entries via
__split_huge_pmd() and return 0. The fault retries and do_swap_page()
takes over per-PTE. This avoids returning VM_FAULT_OOM for transient
PMD-order allocation failures.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/huge_mm.h | 9 ++
mm/huge_memory.c | 198 ++++++++++++++++++++++++++++++++++++++++
mm/internal.h | 36 ++++++++
mm/memory.c | 40 +-------
4 files changed, 247 insertions(+), 36 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1487bf4af1a7..9ec475ccfc91 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -531,6 +531,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+#ifdef CONFIG_THP_SWAP
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+#else
+static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+ return 0;
+}
+#endif
+
extern struct folio *huge_zero_folio;
extern unsigned long huge_zero_pfn;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f1379c8a92e5..3fc2f6e5eafa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2314,6 +2314,204 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
return 0;
}
+#ifdef CONFIG_THP_SWAP
+/**
+ * do_huge_pmd_swap_page() - Handle a fault on a PMD-level swap entry.
+ * @vmf: Fault context. vmf->orig_pmd contains the swap PMD.
+ *
+ * Looks up the folio in the swap cache, and if it is a PMD-sized folio,
+ * maps it directly at the PMD level. If the folio is not in the swap
+ * cache, allocates a PMD-sized folio and reads from swap. On allocation
+ * failure, splits the PMD swap entry into PTE-level entries and retries
+ * at PTE granularity.
+ *
+ * Return: VM_FAULT_* flags.
+ */
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ struct folio *folio;
+ struct page *page;
+ struct swap_info_struct *si;
+ unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+ softleaf_t entry;
+ swp_entry_t swp_entry;
+ pmd_t pmd;
+ vm_fault_t ret = 0;
+ bool exclusive;
+ rmap_t rmap_flags = RMAP_NONE;
+
+ entry = softleaf_from_pmd(vmf->orig_pmd);
+ if (unlikely(!softleaf_is_swap(entry)))
+ return 0;
+
+ swp_entry = entry;
+
+ /* Prevent swapoff from happening to us. */
+ si = get_swap_device(swp_entry);
+ if (unlikely(!si))
+ return 0;
+
+ folio = swap_cache_get_folio(swp_entry);
+ if (!folio) {
+ folio = swapin_sync(swp_entry, GFP_HIGHUSER_MOVABLE,
+ BIT(HPAGE_PMD_ORDER), vmf, NULL, 0);
+ if (IS_ERR_OR_NULL(folio))
+ goto split_fallback;
+
+ /* Had to read from swap area: Major fault */
+ ret = VM_FAULT_MAJOR;
+ count_vm_event(PGMAJFAULT);
+ count_memcg_event_mm(mm, PGMAJFAULT);
+ }
+
+ ret |= folio_lock_or_retry(folio, vmf);
+ if (ret & VM_FAULT_RETRY)
+ goto out_release;
+
+ /* Verify the folio is still in swap cache and matches our entry */
+ if (unlikely(!folio_matches_swap_entry(folio, swp_entry)))
+ goto out_page;
+
+ /*
+ * Folio should be PMD-sized; if not (e.g. split in swap cache),
+ * split the PMD swap entry and retry at PTE level.
+ */
+ if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+ folio_unlock(folio);
+ folio_put(folio);
+ goto split_fallback;
+ }
+
+ if (unlikely(!folio_test_uptodate(folio))) {
+ ret = VM_FAULT_SIGBUS;
+ goto out_page;
+ }
+
+ page = folio_page(folio, 0);
+ arch_swap_restore(folio_swap(swp_entry, folio), folio);
+
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !folio_test_lru(folio))
+ lru_add_drain();
+
+ folio_throttle_swaprate(folio, GFP_KERNEL);
+
+ /* Lock the PMD and verify it hasn't changed */
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (unlikely(!pmd_same(vmf->orig_pmd, pmdp_get(vmf->pmd)))) {
+ spin_unlock(vmf->ptl);
+ goto out_page;
+ }
+
+ exclusive = pmd_swp_exclusive(vmf->orig_pmd);
+
+ /*
+ * Some swap backends (e.g. zram) don't support concurrent page
+ * modifications while under writeback. If we map exclusive on such
+ * a backend while the folio is still under writeback, the writeback
+ * may see partial modifications and corrupt the swap slot. Drop the
+ * exclusive marker and only map R/O for that case; further GUP
+ * references can't appear once the page is fully unmapped, so this
+ * is safe.
+ */
+ if (exclusive && folio_test_writeback(folio) &&
+ data_race(si->flags & SWP_STABLE_WRITES))
+ exclusive = false;
+
+ /*
+ * Set up the PMD mapping. Similar to do_swap_page() but at PMD level.
+ */
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+ pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+ pmd = pmd_mkyoung(pmd);
+
+ if (pmd_swp_soft_dirty(vmf->orig_pmd))
+ pmd = pmd_mksoft_dirty(pmd);
+ if (pmd_swp_uffd_wp(vmf->orig_pmd))
+ pmd = pmd_mkuffd_wp(pmd);
+
+ /*
+ * Check exclusivity to determine if we can map writable.
+ */
+ if (exclusive || folio_ref_count(folio) == 1) {
+ if ((vma->vm_flags & VM_WRITE) &&
+ !userfaultfd_huge_pmd_wp(vma, pmd) &&
+ !pmd_needs_soft_dirty_wp(vma, pmd)) {
+ pmd = pmd_mkwrite(pmd, vma);
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ pmd = pmd_mkdirty(pmd);
+ vmf->flags &= ~FAULT_FLAG_WRITE;
+ }
+ }
+ rmap_flags |= RMAP_EXCLUSIVE;
+ }
+
+ flush_icache_pages(vma, page, HPAGE_PMD_NR);
+
+ if (!folio_test_anon(folio))
+ folio_add_new_anon_rmap(folio, vma, haddr, rmap_flags);
+ else
+ folio_add_anon_rmap_pmd(folio, page, vma, haddr, rmap_flags);
+
+ folio_put_swap(folio, NULL);
+
+ set_pmd_at(mm, haddr, vmf->pmd, pmd);
+ update_mmu_cache_pmd(vma, haddr, vmf->pmd);
+
+ /* Update orig_pmd for any follow-up wp_huge_pmd() below. */
+ vmf->orig_pmd = pmd;
+
+ /*
+ * Conditionally try to free up the swap cache. Do it after mapping,
+ * so raced page faults will likely see the folio in swap cache and
+ * wait on the folio lock.
+ */
+ if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags))
+ folio_free_swap(folio);
+
+ spin_unlock(vmf->ptl);
+
+ folio_unlock(folio);
+ put_swap_device(si);
+
+ /*
+ * If the write fault wasn't satisfied above (folio is shared without
+ * exclusivity), fall through to wp_huge_pmd to handle COW or
+ * userfaultfd-wp without forcing a second fault.
+ *
+ * wp_huge_pmd() may return VM_FAULT_FALLBACK if it had to split the
+ * PMD; that's a normal outcome — the natural PTE-level refault will
+ * complete the COW. Mask it so callers (and the arch fault handler)
+ * don't see VM_FAULT_FALLBACK as a fatal VM_FAULT_ERROR.
+ */
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ vm_fault_t wp_ret = wp_huge_pmd(vmf);
+
+ wp_ret &= ~VM_FAULT_FALLBACK;
+ ret |= wp_ret;
+ if (ret & VM_FAULT_ERROR)
+ ret &= VM_FAULT_ERROR;
+ }
+
+ return ret;
+
+out_page:
+ folio_unlock(folio);
+out_release:
+ folio_put(folio);
+ put_swap_device(si);
+ return ret;
+
+split_fallback:
+ __split_huge_pmd(vma, vmf->pmd, haddr, false);
+ put_swap_device(si);
+ return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
{
pgtable_t pgtable;
diff --git a/mm/internal.h b/mm/internal.h
index ace2f8ef1d35..574dafd18709 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -499,6 +499,42 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
}
vm_fault_t do_swap_page(struct vm_fault *vmf);
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf);
+
+/*
+ * Check if we should call folio_free_swap to free the swap cache.
+ * folio_free_swap only frees the swap cache to release the slot if swap
+ * count is zero, so we don't need to check the swap count here.
+ */
+static inline bool should_try_to_free_swap(struct swap_info_struct *si,
+ struct folio *folio,
+ struct vm_area_struct *vma,
+ unsigned int extra_refs,
+ unsigned int fault_flags)
+{
+ if (!folio_test_swapcache(folio))
+ return false;
+ /*
+ * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
+ * cache can help save some IO or memory overhead, but these devices
+ * are fast, and meanwhile, swap cache pinning the slot deferring the
+ * release of metadata or fragmentation is a more critical issue.
+ */
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+ return true;
+ if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
+ folio_test_mlocked(folio))
+ return true;
+ /*
+ * If we want to map a page that's in the swapcache writable, we
+ * have to detect via the refcount if we're really the exclusive
+ * user. Try freeing the swapcache to get rid of the swapcache
+ * reference only in case it's likely that we'll be the exclusive user.
+ */
+ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
+ folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+}
+
void folio_rotate_reclaimable(struct folio *folio);
bool __folio_end_writeback(struct folio *folio);
void deactivate_file_folio(struct folio *folio);
diff --git a/mm/memory.c b/mm/memory.c
index 5cf02e394c92..7272a10a0fe0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4497,40 +4497,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
return 0;
}
-/*
- * Check if we should call folio_free_swap to free the swap cache.
- * folio_free_swap only frees the swap cache to release the slot if swap
- * count is zero, so we don't need to check the swap count here.
- */
-static inline bool should_try_to_free_swap(struct swap_info_struct *si,
- struct folio *folio,
- struct vm_area_struct *vma,
- unsigned int extra_refs,
- unsigned int fault_flags)
-{
- if (!folio_test_swapcache(folio))
- return false;
- /*
- * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
- * cache can help save some IO or memory overhead, but these devices
- * are fast, and meanwhile, swap cache pinning the slot deferring the
- * release of metadata or fragmentation is a more critical issue.
- */
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
- return true;
- if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
- folio_test_mlocked(folio))
- return true;
- /*
- * If we want to map a page that's in the swapcache writable, we
- * have to detect via the refcount if we're really the exclusive
- * user. Try freeing the swapcache to get rid of the swapcache
- * reference only in case it's likely that we'll be the exclusive user.
- */
- return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
-}
-
static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
{
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -6200,8 +6166,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
return VM_FAULT_FALLBACK;
}
-/* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
@@ -6486,6 +6451,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (pmd_is_migration_entry(vmf.orig_pmd))
pmd_migration_entry_wait(mm, vmf.pmd);
+ else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+ pmd_is_swap_entry(vmf.orig_pmd))
+ return do_huge_pmd_swap_page(&vmf);
return 0;
}
if (pmd_trans_huge(vmf.orig_pmd)) {
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 15/16] mm: install PMD swap entries on swap-out
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (13 preceding siblings ...)
2026-06-02 14:24 ` [v2 14/16] mm: handle PMD swap entry faults on swap-in Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-02 14:24 ` [v2 16/16] selftests/mm: add PMD swap entry tests Usama Arif
2026-06-09 14:29 ` [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap
entries before unmap, losing the huge mapping across the swap
round-trip and forcing khugepaged to rebuild it later. The
contiguous swap range was already secured when the folio was added
to the swap cache (a non-contiguous allocation would have split the
folio earlier), so the PMD can be replaced by a single PMD-level
swap entry instead.
This patch mirrors the existing PTE swap-out path at PMD
granularity:
- shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable
swapcache folios, gated on zswap_never_enabled() since zswap
cannot reconstruct a 2 MB folio from per-page blobs (Best
to handle zswap case separately).
- try_to_unmap_one() now has a PMD branch that calls
set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by
HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the
fallback.
- set_pmd_swap_entry() is the installer. Mirroring the PTE
swap-out sequence at PMD granularity, it clears the present
mapping (keeping the original for rollback), bumps the swap_map
refcount for the folio's 512 slots, drops the exclusive mark if
the page was anon-exclusive, propagates the dirty bit to the
folio so writeback is not lost, and installs a swap PMD that
preserves the original soft-dirty / uffd-wp / exclusive bits.
Any failing step rolls back the present mapping.
The swap entry value matches what 512 PTE swap entries would
encode, so swap_map refcounting is unchanged: each of the 512 slots
carries a count of 1, released individually on later split or
together on swap-in.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/huge_mm.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++
mm/rmap.c | 20 +++++++++
mm/vmscan.c | 14 ++++++-
mm/vmstat.c | 1 +
6 files changed, 115 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9ec475ccfc91..b746f8c8db69 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -533,6 +533,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
#ifdef CONFIG_THP_SWAP
vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio);
#else
static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7267c06674c0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC_FAILED,
THP_SWPOUT,
THP_SWPOUT_FALLBACK,
+ THP_SWPOUT_PMD,
#endif
#ifdef CONFIG_BALLOON
BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3fc2f6e5eafa..1fed86065fd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -5385,3 +5385,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
trace_remove_migration_pmd(address, pmd_val(pmde));
}
#endif
+
+#ifdef CONFIG_THP_SWAP
+/**
+ * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry.
+ * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and
+ * pvmw->pte NULL (i.e. PMD-mapped).
+ * @folio: The folio being swapped out. Must be in the swap cache.
+ *
+ * This installs a PMD-level swap entry in place of a present PMD mapping,
+ * avoiding the need to split the PMD into PTE-level swap entries.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *page = folio_page(folio, 0);
+ bool anon_exclusive;
+ pmd_t pmdval;
+ swp_entry_t entry;
+ pmd_t pmdswp;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return 0;
+
+ VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+
+ if (unlikely(folio_test_swapbacked(folio) !=
+ folio_test_swapcache(folio))) {
+ WARN_ON_ONCE(1);
+ return -EBUSY;
+ }
+
+ flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+ pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (folio_dup_swap(folio, NULL) < 0) {
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -ENOMEM;
+ }
+
+ /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+ anon_exclusive = PageAnonExclusive(page);
+ if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
+ folio_put_swap(folio, NULL);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -EBUSY;
+ }
+
+ if (pmd_dirty(pmdval))
+ folio_mark_dirty(folio);
+
+ entry = folio->swap;
+ pmdswp = softleaf_to_pmd(entry);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+ if (pmd_uffd_wp(pmdval))
+ pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+ if (anon_exclusive)
+ pmdswp = pmd_swp_mkexclusive(pmdswp);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdswp);
+
+ folio_remove_rmap_pmd(folio, page, vma);
+ folio_put(folio);
+
+ count_vm_event(THP_SWPOUT_PMD);
+ return 0;
+}
+#endif /* CONFIG_THP_SWAP */
diff --git a/mm/rmap.c b/mm/rmap.c
index 0fb7a1b82cf3..ffc7aa62a29e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2079,6 +2079,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_abort;
}
+#ifdef CONFIG_THP_SWAP
+ /*
+ * If the folio is in the swap cache and we're not
+ * asked to split, install a PMD-level swap entry.
+ */
+ if (!(flags & TTU_SPLIT_HUGE_PMD) &&
+ folio_test_anon(folio) &&
+ folio_test_swapcache(folio)) {
+ if (set_pmd_swap_entry(&pvmw, folio))
+ goto walk_abort;
+
+ mm_prepare_for_swap_entries(mm);
+ add_mm_counter(mm, MM_ANONPAGES,
+ -HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS,
+ HPAGE_PMD_NR);
+ goto walk_done;
+ }
+#endif
+
if (flags & TTU_SPLIT_HUGE_PMD) {
/*
* We temporarily have to drop the PTL and
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e8a90911bf88..0f376fbf9bb3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -64,6 +64,7 @@
#include <linux/swapops.h>
#include <linux/sched/sysctl.h>
+#include <linux/zswap.h>
#include "internal.h"
#include "swap.h"
@@ -1332,7 +1333,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum ttu_flags flags = TTU_BATCH_FLUSH;
bool was_swapbacked = folio_test_swapbacked(folio);
- if (folio_test_pmd_mappable(folio))
+ /*
+ * With THP_SWAP, PMD-mappable folios already in the
+ * swap cache can be unmapped with a PMD-level swap
+ * entry, avoiding the cost of splitting the PMD.
+ * Skip this when zswap has been enabled because
+ * zswap stores pages individually and cannot
+ * reconstruct a large folio on swap-in.
+ */
+ if (folio_test_pmd_mappable(folio) &&
+ !(IS_ENABLED(CONFIG_THP_SWAP) &&
+ folio_test_swapcache(folio) &&
+ zswap_never_enabled()))
flags |= TTU_SPLIT_HUGE_PMD;
/*
* Without TTU_SYNC, try_to_unmap will only begin to
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..9b4963a7eb04 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
[I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed",
[I(THP_SWPOUT)] = "thp_swpout",
[I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback",
+ [I(THP_SWPOUT_PMD)] = "thp_swpout_pmd",
#endif
#ifdef CONFIG_BALLOON
[I(BALLOON_INFLATE)] = "balloon_inflate",
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [v2 16/16] selftests/mm: add PMD swap entry tests
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (14 preceding siblings ...)
2026-06-02 14:24 ` [v2 15/16] mm: install PMD swap entries on swap-out Usama Arif
@ 2026-06-02 14:24 ` Usama Arif
2026-06-09 14:29 ` [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
16 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-02 14:24 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team, Usama Arif
Exercise the PMD swap entry paths. The tests allocate a PMD-mapped
THP, write a known pattern, swap it out via MADV_PAGEOUT, and then
exercise different code paths:
- swap-out / swap-in round-trip with data verification
- fork with read-only access from both parent and child
- fork with writes in both processes to verify COW isolation
- repeated swap cycles to try and catch reference counting issues
- write fault on a swapped PMD to verify dirty handling
- munmap of a swapped PMD (zap_huge_pmd swap slot cleanup)
- mprotect on a swapped PMD (change_non_present_huge_pmd)
- mremap of a swapped PMD (move_soft_dirty_pmd)
- pagemap reading (pagemap_pmd_range_thp softleaf_has_pfn guard)
- MADV_FREE on a swapped PMD: verifies swap slots are freed via
pagemap and the memory reads back as zero
- UFFDIO_MOVE on a swapped PMD (move_pages_huge_pmd swap path);
verifies the entry transfers without splitting and that the
destination faults back in as a THP
- swapoff with active PMD swap entries (unuse_pmd_range split)
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/pmd_swap.c | 672 ++++++++++++++++++++++++++
2 files changed, 673 insertions(+)
create mode 100644 tools/testing/selftests/mm/pmd_swap.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e6df968f0971..d442dac8460c 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -105,6 +105,7 @@ TEST_GEN_FILES += guard-regions
TEST_GEN_FILES += merge
TEST_GEN_FILES += rmap
TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += pmd_swap
ifneq ($(ARCH),arm64)
TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/pmd_swap.c b/tools/testing/selftests/mm/pmd_swap.c
new file mode 100644
index 000000000000..01897bfa17dd
--- /dev/null
+++ b/tools/testing/selftests/mm/pmd_swap.c
@@ -0,0 +1,672 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test PMD-level swap entries.
+ *
+ * Verifies that when a PMD-mapped THP is swapped out the kernel installs
+ * a single PMD-level swap entry (instead of splitting into 512 PTE-level
+ * entries), and that operations on the swapped region behave correctly:
+ * basic - swap out + swap in preserves data
+ * fork - parent and child both see the data
+ * fork_cow - COW after fork keeps parent's data isolated
+ * cycles - repeated swap out/in does not corrupt data
+ * write - faulting in via a write keeps the rest of the THP
+ * munmap - munmap on a PMD swap entry frees swap slots cleanly
+ * mprotect - mprotect on a PMD swap entry preserves data
+ * mremap - mremap on a PMD swap entry preserves data
+ * pagemap - pagemap reports the entries as swapped
+ * madvise_free - MADV_FREE on a PMD swap entry does not crash
+ * madvise_willneed - MADV_WILLNEED reads the THP in at PMD order
+ * uffdio_move - UFFDIO_MOVE moves a PMD swap entry whole-PMD
+ * swapoff - swapoff faults the THP back in (needs PMD_SWAP_DEVICE)
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdint.h>
+#include <sys/random.h>
+#include <sys/swap.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <time.h>
+
+#include "kselftest_harness.h"
+#include "vm_util.h"
+
+static bool check_swapped(int pagemap_fd, char *addr, unsigned long size)
+{
+ unsigned long off;
+
+ for (off = 0; off < size; off += getpagesize())
+ if (!pagemap_is_swapped(pagemap_fd, addr + off))
+ return false;
+ return true;
+}
+
+static bool swap_available(int pagemap_fd)
+{
+ char *p;
+ bool ret;
+
+ p = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (p == MAP_FAILED)
+ return false;
+
+ memset(p, 0xab, getpagesize());
+ madvise(p, getpagesize(), MADV_PAGEOUT);
+ ret = pagemap_is_swapped(pagemap_fd, p);
+ munmap(p, getpagesize());
+ return ret;
+}
+
+static unsigned long read_vm_event(const char *name)
+{
+ char line[256];
+ size_t name_len = strlen(name);
+ unsigned long val = 0;
+ FILE *f;
+
+ f = fopen("/proc/vmstat", "r");
+ if (!f)
+ return 0;
+ while (fgets(line, sizeof(line), f)) {
+ if (!strncmp(line, name, name_len) && line[name_len] == ' ') {
+ val = strtoul(line + name_len + 1, NULL, 10);
+ break;
+ }
+ }
+ fclose(f);
+ return val;
+}
+
+static bool read_pmd_mthp_stat(unsigned long pmd_size, const char *name,
+ unsigned long *val)
+{
+ char path[256];
+ FILE *f;
+ int ret;
+
+ ret = snprintf(path, sizeof(path),
+ "/sys/kernel/mm/transparent_hugepage/hugepages-%lukB/stats/%s",
+ pmd_size >> 10, name);
+ if (ret < 0 || ret >= sizeof(path))
+ return false;
+
+ f = fopen(path, "r");
+ if (!f)
+ return false;
+
+ ret = fscanf(f, "%lu", val);
+ fclose(f);
+ return ret == 1;
+}
+
+static unsigned int random_seed(void)
+{
+ unsigned int seed;
+
+ if (getrandom(&seed, sizeof(seed), 0) != sizeof(seed))
+ seed = (unsigned int)time(NULL);
+ return seed;
+}
+
+static unsigned char pattern_byte(unsigned int seed, unsigned long off)
+{
+ return (unsigned char)(seed + off);
+}
+
+static void fill_pattern(char *buf, unsigned long size, unsigned int seed)
+{
+ unsigned long i;
+
+ for (i = 0; i < size; i++)
+ buf[i] = (char)pattern_byte(seed, i);
+}
+
+static bool verify_pattern(char *buf, unsigned long size, unsigned int seed)
+{
+ unsigned long i;
+
+ for (i = 0; i < size; i++)
+ if ((unsigned char)buf[i] != pattern_byte(seed, i))
+ return false;
+ return true;
+}
+
+/*
+ * mmap an anonymous PMD-aligned region of pmd_size bytes. Over-allocates
+ * by one PMD and trims the unaligned head/tail so the returned address is
+ * PMD-aligned (required for whole-PMD UFFDIO_MOVE).
+ */
+static char *mmap_pmd_aligned(unsigned long pmd_size)
+{
+ unsigned long pad = pmd_size;
+ char *raw, *aligned;
+
+ raw = mmap(NULL, pmd_size + pad, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (raw == MAP_FAILED)
+ return MAP_FAILED;
+
+ aligned = (char *)(((uintptr_t)raw + pmd_size - 1) & ~(pmd_size - 1));
+ if (aligned != raw)
+ munmap(raw, aligned - raw);
+ if (aligned + pmd_size != raw + pmd_size + pad)
+ munmap(aligned + pmd_size,
+ (raw + pmd_size + pad) - (aligned + pmd_size));
+ return aligned;
+}
+
+/*
+ * mmap a PMD-aligned PMD-sized region, request THP, fill with a pattern,
+ * and swap it out. Verifies via the thp_swpout_pmd vmstat counter that
+ * the swap-out installed a PMD swap entry rather than splitting to PTEs.
+ */
+static char *alloc_fill_swap_thp(unsigned long pmd_size, int pagemap_fd,
+ unsigned int seed)
+{
+ unsigned long pmd_before, pmd_after;
+ char *mem;
+
+ mem = mmap_pmd_aligned(pmd_size);
+ if (mem == MAP_FAILED)
+ return MAP_FAILED;
+
+ madvise(mem, pmd_size, MADV_HUGEPAGE);
+ fill_pattern(mem, pmd_size, seed);
+
+ pmd_before = read_vm_event("thp_swpout_pmd");
+
+ if (madvise(mem, pmd_size, MADV_PAGEOUT) ||
+ !check_swapped(pagemap_fd, mem, pmd_size)) {
+ munmap(mem, pmd_size);
+ return MAP_FAILED;
+ }
+
+ pmd_after = read_vm_event("thp_swpout_pmd");
+ printf("# thp_swpout_pmd: %lu -> %lu\n", pmd_before, pmd_after);
+ if (pmd_after - pmd_before < 1) {
+ munmap(mem, pmd_size);
+ return MAP_FAILED;
+ }
+ return mem;
+}
+
+FIXTURE(pmd_swap)
+{
+ unsigned long pmd_size;
+ int pagemap_fd;
+ unsigned int seed;
+};
+
+FIXTURE_SETUP(pmd_swap)
+{
+ self->pagemap_fd = -1;
+
+ self->pmd_size = read_pmd_pagesize();
+ if (!self->pmd_size)
+ SKIP(return, "Cannot determine PMD size\n");
+
+ self->pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (self->pagemap_fd < 0)
+ SKIP(return, "Cannot open /proc/self/pagemap\n");
+
+ if (!swap_available(self->pagemap_fd))
+ SKIP(return, "Swap not available or not working\n");
+
+ self->seed = random_seed();
+}
+
+FIXTURE_TEARDOWN(pmd_swap)
+{
+ if (self->pagemap_fd >= 0)
+ close(self->pagemap_fd);
+}
+
+/*
+ * Allocate a PMD-sized THP, write a pattern, swap it out, read it back,
+ * verify the pattern.
+ */
+TEST_F(pmd_swap, basic)
+{
+ char *mem;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * Allocate a THP, swap it out, fork, verify both parent and child see
+ * the correct data.
+ */
+TEST_F(pmd_swap, fork)
+{
+ char *mem;
+ pid_t pid;
+ int status;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ _exit(verify_pattern(mem, self->pmd_size, self->seed) ? 0 : 1);
+ }
+
+ ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+ ASSERT_EQ(waitpid(pid, &status, 0), pid);
+ ASSERT_TRUE(WIFEXITED(status));
+ ASSERT_EQ(WEXITSTATUS(status), 0);
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * Swap out, fork, then have parent and child write different patterns.
+ * Exercises COW on shared PMD swap entries: writes after fork must
+ * trigger copy-on-write so the parent's data stays isolated.
+ */
+TEST_F(pmd_swap, fork_cow)
+{
+ unsigned int parent_seed = self->seed;
+ unsigned int child_seed = ~self->seed;
+ char *mem;
+ pid_t pid;
+ int status;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, parent_seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ fill_pattern(mem, self->pmd_size, child_seed);
+ _exit(verify_pattern(mem, self->pmd_size, child_seed) ? 0 : 1);
+ }
+
+ ASSERT_EQ(waitpid(pid, &status, 0), pid);
+
+ ASSERT_TRUE(verify_pattern(mem, self->pmd_size, parent_seed));
+ ASSERT_TRUE(WIFEXITED(status));
+ ASSERT_EQ(WEXITSTATUS(status), 0);
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * Swap a THP out and in repeatedly without data corruption.
+ */
+TEST_F(pmd_swap, cycles)
+{
+ const int num_cycles = 5;
+ char *mem;
+ int cycle;
+
+ for (cycle = 0; cycle < num_cycles; cycle++) {
+ unsigned int seed = self->seed + cycle;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP at cycle %d\n",
+ cycle);
+
+ ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed));
+
+ munmap(mem, self->pmd_size);
+ }
+}
+
+/*
+ * Swap out, fault in via a write to the first page, verify the write
+ * sticks and the rest of the THP is preserved.
+ */
+TEST_F(pmd_swap, write)
+{
+ unsigned int seed = self->seed;
+ char *mem;
+ unsigned long i;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ mem[0] = 0xbb;
+ ASSERT_EQ(mem[0], (char)0xbb);
+
+ for (i = 1; i < self->pmd_size; i++)
+ ASSERT_EQ((unsigned char)mem[i], pattern_byte(seed, i));
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * munmap while the folio is swapped out. Exercises zap_huge_pmd() on a
+ * PMD swap entry — must free the swap slots without trying to look up
+ * a folio.
+ */
+TEST_F(pmd_swap, munmap)
+{
+ char *mem;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * Change protection on a swapped PMD entry, then fault back in and
+ * verify data. Exercises change_non_present_huge_pmd().
+ */
+TEST_F(pmd_swap, mprotect)
+{
+ unsigned int seed = self->seed;
+ char *mem;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ), 0);
+ ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ | PROT_WRITE), 0);
+
+ ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed));
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * UFFDIO_MOVE a PMD swap entry from src to a registered dst. Exercises
+ * move_pages_huge_pmd() handling of pmd_is_swap_entry: the whole PMD swap
+ * entry must move to dst without splitting, and the destination must
+ * read back the original pattern after a swap-in fault.
+ */
+TEST_F(pmd_swap, uffdio_move)
+{
+ unsigned int seed = self->seed;
+ struct uffdio_register reg = {};
+ struct uffdio_move move = {};
+ struct uffdio_api api = {};
+ char *src, *dst;
+ int uffd;
+
+ dst = mmap_pmd_aligned(self->pmd_size);
+ if (dst == MAP_FAILED)
+ SKIP(return, "Could not mmap aligned dst\n");
+
+ src = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+ if (src == MAP_FAILED) {
+ munmap(dst, self->pmd_size);
+ SKIP(return, "Could not create swapped THP\n");
+ }
+ if ((uintptr_t)src & (self->pmd_size - 1)) {
+ munmap(src, self->pmd_size);
+ munmap(dst, self->pmd_size);
+ SKIP(return, "src not PMD-aligned\n");
+ }
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd < 0) {
+ munmap(src, self->pmd_size);
+ munmap(dst, self->pmd_size);
+ SKIP(return, "userfaultfd unavailable\n");
+ }
+
+ api.api = UFFD_API;
+ api.features = UFFD_FEATURE_MOVE;
+ if (ioctl(uffd, UFFDIO_API, &api) ||
+ !(api.features & UFFD_FEATURE_MOVE)) {
+ close(uffd);
+ munmap(src, self->pmd_size);
+ munmap(dst, self->pmd_size);
+ SKIP(return, "UFFD_FEATURE_MOVE unsupported\n");
+ }
+
+ reg.range.start = (unsigned long)dst;
+ reg.range.len = self->pmd_size;
+ reg.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (ioctl(uffd, UFFDIO_REGISTER, ®)) {
+ close(uffd);
+ munmap(src, self->pmd_size);
+ munmap(dst, self->pmd_size);
+ SKIP(return, "UFFDIO_REGISTER failed\n");
+ }
+
+ move.dst = (unsigned long)dst;
+ move.src = (unsigned long)src;
+ move.len = self->pmd_size;
+ if (ioctl(uffd, UFFDIO_MOVE, &move)) {
+ close(uffd);
+ munmap(src, self->pmd_size);
+ munmap(dst, self->pmd_size);
+ ASSERT_EQ(errno, 0);
+ }
+ ASSERT_EQ(move.move, self->pmd_size);
+
+ /*
+ * dst inherits the PMD swap entry; reading it must fault the THP
+ * back in via do_huge_pmd_swap_page() and yield the original data.
+ */
+ ASSERT_TRUE(check_swapped(self->pagemap_fd, dst, self->pmd_size));
+ ASSERT_TRUE(verify_pattern(dst, self->pmd_size, seed));
+ /* The whole-PMD path must reinstate a THP, not 512 PTE folios. */
+ ASSERT_TRUE(check_huge_anon(dst, 1, self->pmd_size));
+
+ close(uffd);
+ munmap(src, self->pmd_size);
+ munmap(dst, self->pmd_size);
+}
+
+/*
+ * Move a swapped PMD entry to a new address, fault in, verify data.
+ * Exercises move_huge_pmd() and move_soft_dirty_pmd().
+ */
+TEST_F(pmd_swap, mremap)
+{
+ unsigned int seed = self->seed;
+ char *mem, *new_mem;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ new_mem = mremap(mem, self->pmd_size, self->pmd_size, MREMAP_MAYMOVE);
+ if (new_mem == MAP_FAILED) {
+ munmap(mem, self->pmd_size);
+ ASSERT_NE(new_mem, MAP_FAILED);
+ }
+
+ ASSERT_TRUE(verify_pattern(new_mem, self->pmd_size, seed));
+
+ munmap(new_mem, self->pmd_size);
+}
+
+/*
+ * Read /proc/self/pagemap on a PMD swap entry. Exercises the pagemap
+ * PMD walker which must handle PMD swap entries without trying to
+ * convert them to a page via softleaf_to_page().
+ */
+TEST_F(pmd_swap, pagemap)
+{
+ char *mem;
+ uint64_t entry;
+ unsigned long off;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ for (off = 0; off < self->pmd_size; off += getpagesize()) {
+ entry = pagemap_get_entry(self->pagemap_fd, mem + off);
+ /* Bit 62 = swapped */
+ ASSERT_TRUE(entry & (1ULL << 62));
+ }
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * MADV_FREE on a swapped-out PMD must free the swap slots and clear the
+ * entry. After the call, pagemap must no longer report the pages as
+ * swapped, and accessing the region must yield zero pages.
+ */
+TEST_F(pmd_swap, madvise_free)
+{
+ char *mem;
+ unsigned long i;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ ASSERT_TRUE(check_swapped(self->pagemap_fd, mem, self->pmd_size));
+ ASSERT_EQ(madvise(mem, self->pmd_size, MADV_FREE), 0);
+ ASSERT_FALSE(check_swapped(self->pagemap_fd, mem, self->pmd_size));
+
+ for (i = 0; i < self->pmd_size; i += getpagesize())
+ ASSERT_EQ(mem[i], 0);
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * MADV_WILLNEED on a swapped-out PMD-mapped THP must not split the
+ * mapping. After WILLNEED + a first-touch fault, the region must come
+ * back as a single PMD-sized THP with the original data intact.
+ */
+TEST_F(pmd_swap, madvise_willneed)
+{
+ unsigned long swpin_before, swpin_after;
+ volatile char c;
+ char *mem;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ if (!read_pmd_mthp_stat(self->pmd_size, "swpin", &swpin_before)) {
+ munmap(mem, self->pmd_size);
+ SKIP(return, "Cannot read PMD-sized THP swpin stat\n");
+ }
+
+ ASSERT_EQ(madvise(mem, self->pmd_size, MADV_WILLNEED), 0);
+ ASSERT_TRUE(read_pmd_mthp_stat(self->pmd_size, "swpin",
+ &swpin_after)) {
+ munmap(mem, self->pmd_size);
+ }
+ ASSERT_GT(swpin_after, swpin_before) {
+ munmap(mem, self->pmd_size);
+ }
+
+ /* First touch faults the THP back in via do_huge_pmd_swap_page(). */
+ c = mem[0];
+ (void)c;
+
+ ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size));
+ ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed));
+
+ munmap(mem, self->pmd_size);
+}
+
+/*
+ * swapoff requires a dedicated swap device path. Use a separate fixture
+ * that picks the device up from the PMD_SWAP_DEVICE environment variable
+ * and skips when unset.
+ */
+FIXTURE(pmd_swap_swapoff)
+{
+ unsigned long pmd_size;
+ int pagemap_fd;
+ const char *swap_dev;
+ unsigned int seed;
+};
+
+FIXTURE_SETUP(pmd_swap_swapoff)
+{
+ self->pagemap_fd = -1;
+ self->swap_dev = getenv("PMD_SWAP_DEVICE");
+ if (!self->swap_dev)
+ SKIP(return, "PMD_SWAP_DEVICE env var not set\n");
+
+ self->pmd_size = read_pmd_pagesize();
+ if (!self->pmd_size)
+ SKIP(return, "Cannot determine PMD size\n");
+
+ self->pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (self->pagemap_fd < 0)
+ SKIP(return, "Cannot open /proc/self/pagemap\n");
+
+ if (!swap_available(self->pagemap_fd))
+ SKIP(return, "Swap not available or not working\n");
+
+ self->seed = random_seed();
+}
+
+FIXTURE_TEARDOWN(pmd_swap_swapoff)
+{
+ if (self->pagemap_fd >= 0)
+ close(self->pagemap_fd);
+}
+
+/*
+ * Swap out a THP, then turn off swap. The kernel must fault the entire
+ * THP back in via unuse_pmd(), preserving the huge mapping. Verify data
+ * is intact and the THP mapping is preserved.
+ */
+TEST_F(pmd_swap_swapoff, basic)
+{
+ unsigned int seed = self->seed;
+ char *mem;
+ int ret, err;
+
+ mem = alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed);
+ if (mem == MAP_FAILED)
+ SKIP(return, "Could not create swapped THP\n");
+
+ ret = swapoff(self->swap_dev);
+ err = errno;
+ ASSERT_EQ(ret, 0) {
+ TH_LOG("swapoff(%s) failed: %s", self->swap_dev, strerror(err));
+ munmap(mem, self->pmd_size);
+ }
+
+ ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)) {
+ swapon(self->swap_dev, 0);
+ munmap(mem, self->pmd_size);
+ }
+
+ ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size)) {
+ swapon(self->swap_dev, 0);
+ munmap(mem, self->pmd_size);
+ }
+
+ ret = swapon(self->swap_dev, 0);
+ err = errno;
+ ASSERT_EQ(ret, 0) {
+ TH_LOG("swapon(%s) failed: %s", self->swap_dev, strerror(err));
+ munmap(mem, self->pmd_size);
+ }
+
+ munmap(mem, self->pmd_size);
+}
+
+TEST_HARNESS_MAIN
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [v2 00/16] mm: PMD-level swap entries for anonymous THPs
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
` (15 preceding siblings ...)
2026-06-02 14:24 ` [v2 16/16] selftests/mm: add PMD swap entry tests Usama Arif
@ 2026-06-09 14:29 ` Usama Arif
2026-06-10 12:24 ` David Hildenbrand (Arm)
16 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2026-06-09 14:29 UTC (permalink / raw)
To: Andrew Morton, david, chrisl, kasong, ljs, ziy,
Linux Memory Management List
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team
On 02/06/2026 15:24, Usama Arif wrote:
> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
>
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,
> no khugepaged involvement is needed. swap_map metadata is identical
> either way (512 single-slot counts), so the PTE split buys nothing
> on the swap side, it is purely a page-table representation change.
>
Hello!
Just following up if there were any reviews/comments on this series!
I know its a large series but was just checking if there was any
feedback?
Thanks!
Usama
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [v2 00/16] mm: PMD-level swap entries for anonymous THPs
2026-06-09 14:29 ` [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
@ 2026-06-10 12:24 ` David Hildenbrand (Arm)
2026-06-10 13:01 ` Lance Yang
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-10 12:24 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, chrisl, kasong, ljs, ziy,
Linux Memory Management List
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Liam R. Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
linux-kernel, nphamcs, shikemeng, kernel-team
On 6/9/26 16:29, Usama Arif wrote:
>
>
> On 02/06/2026 15:24, Usama Arif wrote:
>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>> unmap.
>>
>> This series introduces a PMD-level swap entry. The huge mapping is
>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>> resolves the entire 2 MB region in a single fault on swap-in,
>> no khugepaged involvement is needed. swap_map metadata is identical
>> either way (512 single-slot counts), so the PTE split buys nothing
>> on the swap side, it is purely a page-table representation change.
>>
>
> Hello!
>
> Just following up if there were any reviews/comments on this series!
>
> I know its a large series but was just checking if there was any
> feedback?
It shall be reviewed. We just finished the mTHP khugepaged review to get it into
7.2, so we've all been rather busy.
(I mean, just take a look at the THP-related flood of patches we are fighting
with on a daily basis, it's not funny anymore)
This is clearly going to be 7.3 material, so there is plenty of time given that
the merge window is about to open soon.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [v2 00/16] mm: PMD-level swap entries for anonymous THPs
2026-06-10 12:24 ` David Hildenbrand (Arm)
@ 2026-06-10 13:01 ` Lance Yang
2026-06-10 13:48 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 22+ messages in thread
From: Lance Yang @ 2026-06-10 13:01 UTC (permalink / raw)
To: David Hildenbrand (Arm), Usama Arif
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel, ljs,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Linux Memory Management List, Andrew Morton, Liam R. Howlett,
ryan.roberts, chrisl, Vlastimil Babka, linux-kernel, nphamcs,
shikemeng, kernel-team, kasong, ziy
On 2026/6/10 20:24, David Hildenbrand (Arm) wrote:
> On 6/9/26 16:29, Usama Arif wrote:
>>
>>
>> On 02/06/2026 15:24, Usama Arif wrote:
>>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>>> unmap.
>>>
>>> This series introduces a PMD-level swap entry. The huge mapping is
>>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>>> resolves the entire 2 MB region in a single fault on swap-in,
>>> no khugepaged involvement is needed. swap_map metadata is identical
>>> either way (512 single-slot counts), so the PTE split buys nothing
>>> on the swap side, it is purely a page-table representation change.
>>>
>>
>> Hello!
>>
>> Just following up if there were any reviews/comments on this series!
>>
>> I know its a large series but was just checking if there was any
>> feedback?
>
> It shall be reviewed. We just finished the mTHP khugepaged review to get it into
> 7.2, so we've all been rather busy.
Right, mTHP khugepaged was a rough one. Glad we got it over the line,
but yeah, there's just been a lot of THP work lately. pretty nonstop ...
> (I mean, just take a look at the THP-related flood of patches we are fighting
> with on a daily basis, it's not funny anymore)
>
> This is clearly going to be 7.3 material, so there is plenty of time given that
> the merge window is about to open soon.
Usama, I'll try to make this one a priority too. Looks interesting :P
Cheers, Lance
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [v2 00/16] mm: PMD-level swap entries for anonymous THPs
2026-06-10 13:01 ` Lance Yang
@ 2026-06-10 13:48 ` David Hildenbrand (Arm)
2026-06-10 14:44 ` Usama Arif
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-10 13:48 UTC (permalink / raw)
To: Lance Yang, Usama Arif
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel, ljs,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Linux Memory Management List, Andrew Morton, Liam R. Howlett,
ryan.roberts, chrisl, Vlastimil Babka, linux-kernel, nphamcs,
shikemeng, kernel-team, kasong, ziy
On 6/10/26 15:01, Lance Yang wrote:
>
>
> On 2026/6/10 20:24, David Hildenbrand (Arm) wrote:
>> On 6/9/26 16:29, Usama Arif wrote:
>>>
>>>
>>>
>>> Hello!
>>>
>>> Just following up if there were any reviews/comments on this series!
>>>
>>> I know its a large series but was just checking if there was any
>>> feedback?
>>
>> It shall be reviewed. We just finished the mTHP khugepaged review to get it into
>> 7.2, so we've all been rather busy.
>
> Right, mTHP khugepaged was a rough one. Glad we got it over the line,
> but yeah, there's just been a lot of THP work lately. pretty nonstop ...
>
>> (I mean, just take a look at the THP-related flood of patches we are fighting
>> with on a daily basis, it's not funny anymore)
>>
>> This is clearly going to be 7.3 material, so there is plenty of time given that
>> the merge window is about to open soon.
>
> Usama, I'll try to make this one a priority too. Looks interesting :P
I have two other bigger series to review, but I should soon get to this as well.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [v2 00/16] mm: PMD-level swap entries for anonymous THPs
2026-06-10 13:48 ` David Hildenbrand (Arm)
@ 2026-06-10 14:44 ` Usama Arif
0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-06-10 14:44 UTC (permalink / raw)
To: David Hildenbrand (Arm), Lance Yang
Cc: ying.huang, Baoquan He, willy, youngjun.park, hannes, riel, ljs,
shakeel.butt, alex, kas, baohua, dev.jain, baolin.wang, npache,
Linux Memory Management List, Andrew Morton, Liam R. Howlett,
ryan.roberts, chrisl, Vlastimil Babka, linux-kernel, nphamcs,
shikemeng, kernel-team, kasong, ziy
On 10/06/2026 14:48, David Hildenbrand (Arm) wrote:
> On 6/10/26 15:01, Lance Yang wrote:
>>
>>
>> On 2026/6/10 20:24, David Hildenbrand (Arm) wrote:
>>> On 6/9/26 16:29, Usama Arif wrote:
>>>>
>>>>
>>>>
>>>> Hello!
>>>>
>>>> Just following up if there were any reviews/comments on this series!
>>>>
>>>> I know its a large series but was just checking if there was any
>>>> feedback?
>>>
>>> It shall be reviewed. We just finished the mTHP khugepaged review to get it into
>>> 7.2, so we've all been rather busy.
>>
>> Right, mTHP khugepaged was a rough one. Glad we got it over the line,
>> but yeah, there's just been a lot of THP work lately. pretty nonstop ...
>>
Yeah its definitely a lot. I have set a target of leaving review comments on
atleast 2 patches from mm per day myself, but even that can sometimes be
difficult! I will try and help out more in reviews.
>>> (I mean, just take a look at the THP-related flood of patches we are fighting
>>> with on a daily basis, it's not funny anymore)
>>>
>>> This is clearly going to be 7.3 material, so there is plenty of time given that
>>> the merge window is about to open soon.
>>
>> Usama, I'll try to make this one a priority too. Looks interesting :P
Thanks Lance!
>
> I have two other bigger series to review, but I should soon get to this as well.
>
No worries at all! Thanks for the reviews! and yeah definitely 7.3.
I will send this out again when 7.3-rc1 opens (rebased), so that the reviews wont be on
outdated code which could cause some confusion.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-06-10 14:44 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-06-02 14:24 ` [v2 01/16] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-06-02 14:24 ` [v2 02/16] mm: extract mm_prepare_for_swap_entries() helper Usama Arif
2026-06-02 14:24 ` [v2 03/16] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-06-02 14:24 ` [v2 04/16] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-06-02 14:24 ` [v2 05/16] mm/migrate_device: move softleaf_to_folio() inside device-private branch Usama Arif
2026-06-02 14:24 ` [v2 06/16] mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF Usama Arif
2026-06-02 14:24 ` [v2 07/16] mm: add PMD swap entry detection support Usama Arif
2026-06-02 14:24 ` [v2 08/16] mm: add PMD swap entry splitting support Usama Arif
2026-06-02 14:24 ` [v2 09/16] mm: handle PMD swap entries in fork path Usama Arif
2026-06-02 14:24 ` [v2 10/16] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-06-02 14:24 ` [v2 11/16] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-06-02 14:24 ` [v2 12/16] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
2026-06-02 14:24 ` [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-06-02 14:24 ` [v2 14/16] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-06-02 14:24 ` [v2 15/16] mm: install PMD swap entries on swap-out Usama Arif
2026-06-02 14:24 ` [v2 16/16] selftests/mm: add PMD swap entry tests Usama Arif
2026-06-09 14:29 ` [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-06-10 12:24 ` David Hildenbrand (Arm)
2026-06-10 13:01 ` Lance Yang
2026-06-10 13:48 ` David Hildenbrand (Arm)
2026-06-10 14:44 ` Usama Arif
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.