From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-170.mta0.migadu.com (out-170.mta0.migadu.com [91.218.175.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 769CB3F39C7 for ; Tue, 2 Jun 2026 14:25:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.170 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410349; cv=none; b=UnFjNEKB3ylK06ypa2vS9MJepyM9SlOD6NNNm4ISCKTO0ZdApEh1CKKmbYbUHfGRHSkXGd6m22AGsPEK4npbMK/ikltd+C1z0Nj6XUeF7mjRGWKolv9Yoe0B7CwEaxs/JuAwIATVHC56Zv9M1pgru76iXhNYnAC8+y1aNEAmIo8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410349; c=relaxed/simple; bh=FkEOyOpS5bBIMoQ062DtYH9C98aq5FRI21l7ddEZ1sc=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=ZkSPW1zzPAYFWP0C1VHhVU3oetjsrMaBMxESM7JfCmB18zScCsqa+kb7uXP8ng3DlIs2sSjTGEHtlSEue+4pJ7zmHQPWwdYR7uTHHGvT/rEqUUHDjXaQxjZI08uY+yg8HAblKZa8ttRyE7hgLqvQuJQ0KHVrEEjKztPLX9GHMzE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=xJmAVcyp; arc=none smtp.client-ip=91.218.175.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="xJmAVcyp" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410344; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=0bq92elle1vhO9vMWqB0MhH8NjuDbxPDtRQWmK4YN8w=; b=xJmAVcypr7uFblUVy6LGAYUM49XWmWf3mLNZH9PP7/u3n9osPm4CHIiF7KjrRS6yTzYvcN 2BvmoShkMDzug0YrfUV55ab0R3O3EOU+s2zFbDvPcFWUiVVnmEyGZtRvR5Y1ivuk1dn0Df 46nsrlVUfi0mshU4hKuUtbbdlx9cvXY= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 00/16] mm: PMD-level swap entries for anonymous THPs Date: Tue, 2 Jun 2026 07:24:08 -0700 Message-ID: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before unmap. This series introduces a PMD-level swap entry. The huge mapping is preserved across the swap round-trip, and do_huge_pmd_swap_page() resolves the entire 2 MB region in a single fault on swap-in, no khugepaged involvement is needed. swap_map metadata is identical either way (512 single-slot counts), so the PTE split buys nothing on the swap side, it is purely a page-table representation change. This work was brought about after Hugh reported that one of the major blockers for having lazy page table deposit is the lack of PMD swap entries [1]. However, this series has benefits of its own: - The huge mapping is restored on swap-in. Today even when the folio is still in swap cache as a single 2 MB folio, the swap-in path installs 512 PTE mappings -- the PMD mapping is gone, the freshly-materialised PTE table sticks around, and only khugepaged can later collapse the range back into a THP. do_huge_pmd_swap_page() reinstalls the PMD mapping directly in one fault, no khugepaged involvement. - Memory saved per swapped-out THP *once lazy page table deposit is merged* [2]. With lazy page table deposit [2], splitting a PMD into 512 PTE swap entries forces allocation of a 4 KB PTE table page. The new path leaves the pgtable hierarchy at PMD level and avoids that allocation entirely. This will save memory when swapping, which is likely when there is memory pressure and exactly when allocations are most likely to fail. - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp) visit one PMD entry instead of 512 PTEs, reducing traversal time and lock-hold windows. The swap entry value is identical to 512 PTE swap entries (same type, same starting offset), so swap_map refcounting is unchanged. Only the page-table representation differs; the swap slot allocator, swap I/O, and swap cache are untouched. The new path falls back to the existing PTE-split path whenever a PMD-order resource is unavailable: zswap enabled, non-contiguous swap allocation (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in or fork, racing folio split, or rmap-driven split on a swapcache folio. Walkers that previously assumed every non-present PMD encodes a PFN (migration / device_private) are taught to recognise PMD swap entries. Patch breakdown: The series is ordered to preserve git bisectability: every consumer of a PMD swap entry (split, fork, swapoff, walkers, MADV_WILLNEED, UFFDIO_MOVE, swap-in fault) lands before the producer. The swap-out path that actually installs PMD swap entries is the very last functional patch (15), so no intermediate commit can leave the kernel handling a PMD swap entry it does not yet understand. The first 6 patches are preparatory patches. Some of them (like softleaf_to_pmd() change in patch 1) are not exactly needed but its done to hopefully improve code quality and so that the PMD swap entry changes look well integrated with the rest of mm. Prep patches: 1. mm: add softleaf_to_pmd() and convert existing callers PMD counterpart to softleaf_to_pte(); needed to construct a PMD from a swap entry in later patches. 2. mm: extract mm_prepare_for_swap_entries() helper Hoists the "register mm with swapoff" double-checked-locking pattern out of try_to_unmap_one() / copy_nonpresent_pte() so the PMD swap-out and PMD fork paths can reuse it without a third open-coded copy. 3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker pagemap_pmd_range_thp() today calls softleaf_to_page() unconditionally; a PMD swap entry has no PFN and would crash it. 4. mm/huge_memory: move softleaf_to_folio() inside migration branch change_non_present_huge_pmd() today calls softleaf_to_folio() before branching on entry type, so a PMD swap entry would produce a bogus folio pointer that the migration-only code below would then dereference. 5. mm/migrate_device: move softleaf_to_folio() inside device-private branch migrate_vma_collect_pmd() has the same pre-check ordering issue as patch 4 in the migrate-device PMD walker; move the folio lookup inside the device-private check. 6. mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF The config gates the entire PMD softleaf machinery (migration, device-private, and now swap), not just migration; rename to match. Pure rename, no behavioural change. Core patches: 7. PMD swap entry detection (pmd_is_swap_entry, softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive helpers (x86/arm64/s390/riscv/loongarch/powerpc). 8. __split_huge_pmd_locked() learns to split a PMD swap entry into 512 PTE swap entries, used as the fallback when a PMD-order resource is unavailable. 9. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry in one swap_dup_entries_direct(HPAGE_PMD_NR) call, with GFP_KERNEL retry on per-cluster table-allocation failure mirroring copy_pte_range(). 10. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls the PMD; falls back to PTE-split + unuse_pte_range() on error. 11. Walker updates: zap_huge_pmd, change_huge_pmd, change_non_present_huge_pmd, move_soft_dirty_pmd, clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry, queue_folios_pmd (mempolicy), check_pmd_state (khugepaged), mincore_pte_range, and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd VM_BUG_ON extensions. 12. MADV_WILLNEED: swapin_walk_pmd_entry() reads the whole 2 MB folio in at PMD order via swapin_sync(BIT(HPAGE_PMD_ORDER)), so the subsequent fault hits do_huge_pmd_swap_page() and restores the THP mapping. A naive order-0 read-ahead would force the fault to split. 13. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap entry whole via a new move_swap_pmd() helper modeled on move_swap_pte(). 14. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in one shot. Handles racing splits, SWP_STABLE_WRITES read-only mapping, immediate COW for write faults; falls back to PTE-split on any PMD-order resource shortfall. 15. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios (when zswap is disabled), and try_to_unmap_one() installs one PMD swap entry via set_pmd_swap_entry() instead of splitting. Testing: 16. selftests/mm: 13 tests covering swap-out/in, fork, fork+COW, repeated cycles, write fault, munmap, mprotect, mremap, pagemap, MADV_FREE, MADV_WILLNEED, UFFDIO_MOVE, swapoff. Making PMD swap entries work with zswap is another project on its own and should be in a separate follow up series. The patches are on top of mm-new from 31 May (415489ef1cdfe586b4992662bee65286d50232e6). [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/ [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/ v1 -> v2: https://lore.kernel.org/all/20260427100553.2754667-1-usama.arif@linux.dev/ - Patch 1: convert two additional softleaf_to_pmd() callers that landed in mm-unstable since v1 (mm/debug_vm_pgtable.c, mm/migrate_device.c) (Dev) - Patch 2: rename helper ensure_on_mmlist() to mm_prepare_for_swap_entries() to better describe its purpose (David) - Patch 3: drop VM_WARN_ON_ONCE(!pmd_is_migration_entry) as Dev posted it as a separate patch. - Patch 5 (new): move softleaf_to_folio() inside the device-private branch in migrate_vma_collect_pmd(); same class of fix as patch 4 but for the migrate-device PMD walker. - Patch 6 (new): rename CONFIG_ARCH_ENABLE_THP_MIGRATION to CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF so the gate that now drives swap-entry support too is named for what it actually controls (PMD softleaf entries), not just migration. (Dev) - Patch 7: add the missing pmd_swp_exclusive / mkexclusive / clear_exclusive helpers for powerpc. - Patches 10 and 14: use upstream swapin_sync() (bundles swap_cache_alloc_folio + swap_read_folio + the -EEXIST race retry) instead of the bespoke swapin_alloc_pmd_folio() helper from v1; do_swap_page and shmem_swapin_folio use the same helper (Kairui) - Patch 10: construct a stack vm_fault for the swapoff swap-in so the allocator can resolve a mempolicy, mirroring how the PTE swapoff path (unuse_pte_range) already does it. - Patch 11: extend coverage to check_pmd_state() in khugepaged so a swapped-out PMD-mapped THP is treated as SCAN_PMD_MAPPED (matches the existing migration-entry handling). Route the pmd_trans_huge_lock() branch of mincore_pte_range() through mincore_swap() so a swapped-out PMD-mapped THP isn't reported as resident. - Patch 12 (new): handle PMD swap entries in MADV_WILLNEED via swapin_sync(BIT(HPAGE_PMD_ORDER)); a naive order-0 read-ahead would force the subsequent fault to split. - Patch 13: refuse UFFDIO_MOVE with -EBUSY if the swap-cache folio was split between swap-out and the move, matching move_pages_pte()'s rejection of large folios; otherwise only one of the 512 anon-rmaps would be re-anchored to dst_vma. - Patch 16: alloc_fill_swap_thp() now uses the existing mmap_pmd_aligned() helper so tests don't flake/skip based on VA placement; new MADV_WILLNEED test that watches the PMD-order mTHP swpin counter; swapoff test restructured to use the kselftest_harness ASSERT cleanup blocks (no double swapoff, no verify-after-munmap). - Collected Acks and Reviews. Usama Arif (16): mm: add softleaf_to_pmd() and convert existing callers mm: extract mm_prepare_for_swap_entries() helper fs/proc: use softleaf_has_pfn() in pagemap PMD walker mm/huge_memory: move softleaf_to_folio() inside migration branch mm/migrate_device: move softleaf_to_folio() inside device-private branch mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF mm: add PMD swap entry detection support mm: add PMD swap entry splitting support mm: handle PMD swap entries in fork path mm: swap in PMD swap entries as whole THPs during swapoff mm: handle PMD swap entries in non-present PMD walkers mm: handle PMD swap entries in MADV_WILLNEED mm: handle PMD swap entries in UFFDIO_MOVE mm: handle PMD swap entry faults on swap-in mm: install PMD swap entries on swap-out selftests/mm: add PMD swap entry tests arch/arm64/Kconfig | 2 +- arch/arm64/include/asm/pgtable.h | 8 +- arch/loongarch/Kconfig | 2 +- arch/loongarch/include/asm/pgtable.h | 17 + arch/powerpc/include/asm/book3s/64/pgtable.h | 17 +- arch/powerpc/platforms/Kconfig.cputype | 2 +- arch/riscv/Kconfig | 2 +- arch/riscv/include/asm/pgtable.h | 23 +- arch/s390/Kconfig | 2 +- arch/s390/include/asm/pgtable.h | 17 +- arch/x86/Kconfig | 2 +- arch/x86/include/asm/pgtable.h | 17 +- fs/proc/task_mmu.c | 46 +- include/linux/huge_mm.h | 13 +- include/linux/leafops.h | 52 +- include/linux/pgtable.h | 2 +- include/linux/swap.h | 4 +- include/linux/swapops.h | 6 +- include/linux/vm_event_item.h | 1 + mm/Kconfig | 2 +- mm/debug_vm_pgtable.c | 12 +- mm/hmm.c | 7 +- mm/huge_memory.c | 543 ++++++++++++++- mm/internal.h | 49 ++ mm/khugepaged.c | 6 + mm/madvise.c | 45 +- mm/memory.c | 51 +- mm/mempolicy.c | 2 + mm/migrate.c | 4 +- mm/migrate_device.c | 19 +- mm/mincore.c | 14 +- mm/rmap.c | 29 +- mm/swapfile.c | 152 ++++- mm/vmscan.c | 14 +- mm/vmstat.c | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/pmd_swap.c | 672 +++++++++++++++++++ 37 files changed, 1702 insertions(+), 156 deletions(-) create mode 100644 tools/testing/selftests/mm/pmd_swap.c -- 2.52.0