From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27A393B47FC for ; Mon, 27 Apr 2026 10:06:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.184 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284364; cv=none; b=SmO+dUI4wSN9LMm3ARh+evB2WpwEWV8LuN5VFps/oeyC4+jzEmSLuNFkXIcrv2e/GPCkSB1aUZG2370wm+eDIA8OJKjFVJk+bFNxOISCztDYIDQ+3ifZVqF1vDgCJ1ol1gVdXWZYZzuNqPpwIvGb8mxYuODQTcQEO6906yXwj0c= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284364; c=relaxed/simple; bh=TxFu6Z4PyErIl/nxZaLoCrCWm9jQriaeF0PrjbK8y70=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=niC6aR34u6TSraZHaJg+eBbEAKA9D4vvKUvZrQ/6Q3QQ5xOrce9c1XTZevA8A43EjODkXOo5Y+iq2GXyLdhJJ/Kg3oESowCuhSP65zzo/Jls9eHHSfI9hr39JA5T62sF6mtRMfRHcrdcbiTodanhFSvbafxfl7AMzoY1Pl29k4o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=A9iGKl7V; arc=none smtp.client-ip=91.218.175.184 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="A9iGKl7V" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284360; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=77UBsV3fQIIXIXtrqqqxnFW5m0zH3AUHsEUk1/i3a7o=; b=A9iGKl7Vs3JonQyC3gf7Knxm/WxH1kyU6zsI2OjBMlDyD/4c6LskgHCFeGnUh3K1k50rjI x2trO/Fo0chtj+WHXhWcn7GLFk0ysr1s0bW49Cg80ehVza8c6HnddcFONonxCqYk6ndXw5 pI8YNk+1rKqPgBWNh9f/NX8En9Gla8Y= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Date: Mon, 27 Apr 2026 03:01:49 -0700 Message-ID: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before unmap. This series introduces a PMD-level swap entry. The huge mapping is preserved across the swap round-trip, and do_huge_pmd_swap_page() resolves the entire 2 MB region in a single fault on swap-in, no khugepaged involvement is needed. swap_map metadata is identical either way (512 single-slot counts), so the PTE split buys nothing on the swap side, it is purely a page-table representation change. This work was brought about after Hugh reported that one of the major blockers for having lazy page table deposit is the lack of PMD swap entries [1]. However, this series has benefits of its own: - The huge mapping is restored on swap-in. Today even when the folio is still in swap cache as a single 2 MB folio, the swap-in path installs 512 PTE mappings -- the PMD mapping is gone, the freshly-materialised PTE table sticks around, and only khugepaged can later collapse the range back into a THP. do_huge_pmd_swap_page() reinstalls the PMD mapping directly in one fault, no khugepaged involvement. - Memory saved per swapped-out THP *once lazy page table deposit is merged* [2]. With lazy page table deposit [2], splitting a PMD into 512 PTE swap entries forces allocation of a 4 KB PTE table page. The new path leaves the pgtable hierarchy at PMD level and avoids that allocation entirely. This will save memory when swapping, which is likely when there is memory pressure and exactly when allocations are most likely to fail. - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp) visit one PMD entry instead of 512 PTEs, reducing traversal time and lock-hold windows. The swap entry value is identical to 512 PTE swap entries (same type, same starting offset), so swap_map refcounting is unchanged. Only the page-table representation differs; the swap slot allocator, swap I/O, and swap cache are untouched. The new path falls back to the existing PTE-split path whenever a PMD-order resource is unavailable: zswap enabled, non-contiguous swap allocation (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in or fork, racing folio split, or rmap-driven split on a swapcache folio. Walkers that previously assumed every non-present PMD encodes a PFN (migration / device_private) are taught to recognise PMD swap entries. Patch breakdown: The series is ordered to preserve git bisectability: every consumer of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE, swap-in fault) lands before the producer. The swap-out path that actually installs PMD swap entries is the very last functional patch (12), so no intermediate commit can leave the kernel handling a PMD swap entry it does not yet understand. The first 4 patches are preparatory patches. Some of them (like softleaf_to_pmd() change in patch 1) are not exactly needed but its done to hopefully improve code quality and so that the PMD swap entry changes look well integrated with the rest of mm. Prep patches: 1. mm: add softleaf_to_pmd() and convert existing callers PMD counterpart to softleaf_to_pte(); needed to construct a PMD from a swap entry in later patches. 2. mm: extract ensure_on_mmlist() helper Hoists the "register mm with swapoff" double-checked-locking pattern out of try_to_unmap_one() / copy_nonpresent_pte() so the PMD swap-out and PMD fork paths can reuse it without a third open-coded copy. 3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker pagemap_pmd_range_thp() today calls softleaf_to_page() unconditionally; a PMD swap entry has no PFN and would crash it. 4. mm/huge_memory: move softleaf_to_folio() inside migration branch change_non_present_huge_pmd() today calls softleaf_to_folio() before branching on entry type, so a PMD swap entry would produce a bogus folio pointer that the migration-only code below would then dereference. Core patches: 5. PMD swap entry detection (pmd_is_swap_entry, softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive helpers (x86/arm64/s390/riscv/loongarch). 6. __split_huge_pmd_locked() learns to split a PMD swap entry into 512 PTE swap entries, used as the fallback when a PMD-order resource is unavailable. 7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry in one folio_dup_swap() call, with GFP_KERNEL retry mirroring copy_pte_range(). 8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls the PMD; falls back to PTE-split + unuse_pte_range() on error. 9. Walker updates: zap_huge_pmd, change_huge_pmd, change_non_present_huge_pmd, move_soft_dirty_pmd, clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry, queue_folios_pmd (mempolicy), check_pmd_state (khugepaged), and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd VM_BUG_ON extensions. 10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap entry whole via a new move_swap_pmd() helper modeled on move_swap_pte(). 11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in one shot. Handles racing splits, SWP_STABLE_WRITES read-only mapping, immediate COW for write faults; falls back to PTE-split on any PMD-order resource shortfall. 12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios (when zswap is disabled), and try_to_unmap_one() installs one PMD swap entry via set_pmd_swap_entry() instead of splitting. Testing: 13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW, repeated cycles, write fault, munmap, mprotect, mremap, pagemap, MADV_FREE, UFFDIO_MOVE, swapoff. Making PMD swap entries work with zswap is another project on its own and should be in a separate follow up series. The patches are on top of mm-unstable from 23 April (2bcc13c29c711381d815c1ba5d5b25737400c71a). [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/ [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/ Usama Arif (13): mm: add softleaf_to_pmd() and convert existing callers mm: extract ensure_on_mmlist() helper fs/proc: use softleaf_has_pfn() in pagemap PMD walker mm/huge_memory: move softleaf_to_folio() inside migration branch mm: add PMD swap entry detection support mm: add PMD swap entry splitting support mm: handle PMD swap entries in fork path mm: swap in PMD swap entries as whole THPs during swapoff mm: handle PMD swap entries in non-present PMD walkers mm: handle PMD swap entries in UFFDIO_MOVE mm: handle PMD swap entry faults on swap-in mm: install PMD swap entries on swap-out selftests/mm: add PMD swap entry tests arch/arm64/include/asm/pgtable.h | 4 + arch/loongarch/include/asm/pgtable.h | 17 + arch/riscv/include/asm/pgtable.h | 15 + arch/s390/include/asm/pgtable.h | 15 + arch/x86/include/asm/pgtable.h | 15 + fs/proc/task_mmu.c | 47 +- include/linux/huge_mm.h | 11 + include/linux/leafops.h | 44 +- include/linux/swap.h | 4 +- include/linux/vm_event_item.h | 1 + mm/hmm.c | 3 +- mm/huge_memory.c | 540 +++++++++++++++++++++-- mm/internal.h | 49 +++ mm/khugepaged.c | 6 + mm/madvise.c | 5 +- mm/memory.c | 51 +-- mm/mempolicy.c | 2 + mm/rmap.c | 27 +- mm/swap.h | 7 + mm/swap_state.c | 35 ++ mm/swapfile.c | 144 +++++- mm/vmscan.c | 14 +- mm/vmstat.c | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++ 25 files changed, 1554 insertions(+), 111 deletions(-) create mode 100644 tools/testing/selftests/mm/pmd_swap.c -- 2.52.0