From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C38025B09C for ; Fri, 3 Jul 2026 17:39:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100355; cv=none; b=YFL/+bIDgPeII1mqm3vlq1RqD3GPatNBKtcdt536P+ef0H64KjDwEMAgwmS5pXIvOmyYQZjAJQILrIZzzPd/m+bTs1lqmMlIU/TwSBSnO1UJIh4YovWS3pVX9yA3WdW5K/ZJgL+2fwUhQ5qOOT+Wb9ZPEno0YXST9Rs+j30MdX4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100355; c=relaxed/simple; bh=Hi6eZh0WvrCYCyeZxK9YIu6KhfWO2z1KDBTSbPXM/MM=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=dxzlvxJcg7CqOctmlEBZR51OpIHLPrlAeVaQ/xrDZZ5VMuih1MpbdYQfs79CD308BVuWKn/y1DKw1Pw/N/mxyac4FGi8eBICkojq48M7UPdWnpCI2GYShTJJb867QSNxoh+Q1XR3zLj/KIwKMob/xOU1+xOHSuhHKHVwLUZvqzI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Gukx+cOL; arc=none smtp.client-ip=91.218.175.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Gukx+cOL" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1783100351; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=MoJOnOHxm7jLg30HSy4qAc3I5E0YLF/JOfW0fe5lT6Q=; b=Gukx+cOLX36ytmyO0AOZV2HVR4fbblLMBRxTZ6iO53POpf974201sbxG7rcORSr+9ft69V FzC8WT3xf6AnInwibzGNORpyQcFcJDvj1CRVgcmFEmRPZjFg4FaSulAfGN0HSYjAlHv0os hc3awgnB4I1D135z2Gu+BP8B8peEBKo= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com, linux-mm@kvack.org Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Date: Fri, 3 Jul 2026 10:38:17 -0700 Message-ID: <20260703173903.3789516-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT This is the PMD swap entry core series. The preparatory PMD softleaf [1] cleanup series has been merged in akpm/mm-new, so this series is now based directly on akpm/mm-new. When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is split into HPAGE_PMD_NR PTE-level swap entries via TTU_SPLIT_HUGE_PMD before unmap. This series introduces a PMD-level swap entry so the huge mapping can survive the swap round-trip and do_huge_pmd_swap_page() can restore the PMD mapping directly on swap-in, without waiting for khugepaged to collapse the range later. The PMD swap entry is a compact page-table encoding for HPAGE_PMD_NR consecutive swap slots. swap_map accounting remains per-slot and is unchanged. Importantly, a PMD swap entry does not promise that the swap cache always contains one PMD-sized folio. While the cache is empty or contains one PMD-sized folio, PMD-level handling can proceed. Once the cache has split/per-slot state, users either inspect the individual slots directly (mincore) or split the PMD swap entry and retry through the PTE path (fault, swapoff, MADV_WILLNEED, UFFDIO_MOVE). Likewise, if any slot is still backed by zswap's per-page store, PMD-order swap-in consumers split and let the PTE path load the range page by page; an all-on-disk range can still be read back as one PMD-sized folio. The series is ordered so every consumer can handle PMD swap entries before the swap-out producer starts installing them. The swap-out patch is the last functional change. Patch breakdown: 1. mm: add PMD swap entry detection support Add pmd_is_swap_entry(), teach the softleaf layer that PMD swap entries are valid PMD softleaf entries, and add per-arch pmd_swp_*exclusive helpers. 2. mm: add PMD swap entry splitting support Teach __split_huge_pmd_locked() to split a PMD swap entry into HPAGE_PMD_NR PTE swap entries. This is the common fallback path whenever PMD-level handling is unsafe or unavailable. 3. mm: handle PMD swap entries in fork path Copy a PMD swap entry in fork via one swap_dup_entries_direct(HPAGE_PMD_NR) operation, with the same swapoff/mmlist preparation rules used by PTE entries. 4. mm: zswap: add range lookup for large-folio swapin Teach zswap_load() to let all-on-disk large folio reads proceed to the backing swap device, and add zswap_range_has_entry() so PMD swap-entry consumers can split only when a specific range has per-page zswap state. 5. mm: swap in PMD swap entries as whole THPs during swapoff Add swap_pmd_cache_lookup() and use it from swapoff. Empty cache and PMD-sized cache state can be handled at PMD order; split cache state, zswap-backed slots, allocation/read failure, or non-uptodate folios split the PMD and fall back to unuse_pte_range(). 6. mm: handle PMD swap entries in non-present PMD walkers Teach zap, mprotect, soft-dirty, uffd-wp, smaps, mincore, mempolicy, khugepaged, HMM, and madvise walkers about PMD swap entries. mincore reports PMD-sized cache state directly and checks per-page slots after the cache has split. 7. mm: handle PMD swap entries in MADV_WILLNEED Let MADV_WILLNEED prefetch a PMD swap entry at PMD order when safe, treat an already cached PMD-sized folio as complete, and split/retry through PTEs for split cache state, zswap-backed slots, or races with per-slot cache population. 8. mm: handle PMD swap entries in UFFDIO_MOVE Move PMD swap entries whole when the covered range is empty or backed by one PMD-sized folio. Split/per-slot cache state returns -EAGAIN after splitting so retry can use the PTE move path and update per-page rmap metadata. 9. mm: handle PMD swap entry faults on swap-in Add do_huge_pmd_swap_page(). It maps a PMD-sized cached folio directly, or allocates/reads at PMD order when the cache is empty and the range has no zswap entries. Split cache state, zswap-backed slots, and PMD-order resource failures fall back to the PTE path. 10. mm: install PMD swap entries on swap-out Stop forcing TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios and install one PMD swap entry instead. Zswap still stores the THP as per-page compressed entries; PMD-order swap-in consumers preserve a PMD-sized cached folio or read an all-on-disk range as a whole THP, and split before reading per-page zswap state. 11. selftests/mm: add PMD swap entry tests Add pmd_swap selftests covering swap-out/in, fork, fork+COW, repeated cycles, write fault, munmap, mprotect, mremap, pagemap, mincore, MADV_FREE, MADV_WILLNEED, UFFDIO_MOVE, and swapoff. Notes on zswap: Native PMD-order zswap load/store is intentionally left for a follow-up. Alexandre Ghiti is currently working this. This series can still preserve PMD swap entries while zswap is enabled: zswap stores the THP as order-0 entries, and PMD-order swap-in consumers split any range that has zswap entries before reading it. If zswap has written the whole range back to disk, or the swap cache still contains one PMD-sized folio, PMD-level handling can proceed. v2 -> v3: https://lore.kernel.org/all/20260602142537.198755-1-usama.arif@linux.dev/ - Clarified the PMD swap entry rule: it is a compact encoding for HPAGE_PMD_NR swap slots, not a guarantee that swap cache always has one PMD-sized folio. (Lance Yang) - Swapoff, fault, MADV_WILLNEED, and UFFDIO_MOVE now classify the whole PMD swap-cache range and split/retry through the PTE path for split/per-slot cache state. (Lance Yang) - mincore handles PMD swap entries without assuming one lookup covers a split swap-cache range. (Lance Yang) - UFFDIO_MOVE rechecks all HPAGE_PMD_NR slots before moving an empty PMD swap-cache range, avoiding stale rmap metadata for per-slot cached folios. - Added a standalone zswap prerequisite patch from Alexandre that distinguishes all-on-disk large-folio ranges from ranges with per-page zswap entries. - Replaced the global zswap-ever-enabled policy with per-range zswap checks: PMD swap entries can still be installed while zswap is enabled, and PMD-order swap-in consumers split when the range has per-page zswap state. - Added a mincore selftest and updated MADV_WILLNEED coverage so the test checks that the PMD swap entry remains in place until first touch. Total pmd_swap coverage is now 14 tests. v1 -> v2: https://lore.kernel.org/all/20260427100553.2754667-1-usama.arif@linux.dev/ - Patch 1: convert two additional softleaf_to_pmd() callers that landed in mm-unstable since v1 (mm/debug_vm_pgtable.c, mm/migrate_device.c) (Dev) - Patch 2: rename helper ensure_on_mmlist() to mm_prepare_for_swap_entries() to better describe its purpose (David) - Patch 3: drop VM_WARN_ON_ONCE(!pmd_is_migration_entry) as Dev posted it as a separate patch. - Patch 5 (new): move softleaf_to_folio() inside the device-private branch in migrate_vma_collect_pmd(); same class of fix as patch 4 but for the migrate-device PMD walker. - Patch 6 (new): rename CONFIG_ARCH_ENABLE_THP_MIGRATION to CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF so the gate that now drives swap-entry support too is named for what it actually controls (PMD softleaf entries), not just migration. (Dev) - Patch 7: add the missing pmd_swp_exclusive / mkexclusive / clear_exclusive helpers for powerpc. - Patches 10 and 14: use upstream swapin_sync() (bundles swap_cache_alloc_folio + swap_read_folio + the -EEXIST race retry) instead of the bespoke swapin_alloc_pmd_folio() helper from v1; do_swap_page and shmem_swapin_folio use the same helper (Kairui) - Patch 10: construct a stack vm_fault for the swapoff swap-in so the allocator can resolve a mempolicy, mirroring how the PTE swapoff path (unuse_pte_range) already does it. - Patch 11: extend coverage to check_pmd_state() in khugepaged so a swapped-out PMD-mapped THP is treated as SCAN_PMD_MAPPED (matches the existing migration-entry handling). Route the pmd_trans_huge_lock() branch of mincore_pte_range() through mincore_swap() so a swapped-out PMD-mapped THP isn't reported as resident. - Patch 12 (new): handle PMD swap entries in MADV_WILLNEED via swapin_sync(BIT(HPAGE_PMD_ORDER)); a naive order-0 read-ahead would force the subsequent fault to split. - Patch 13: refuse UFFDIO_MOVE with -EBUSY if the swap-cache folio was split between swap-out and the move, matching move_pages_pte()'s rejection of large folios; otherwise only one of the 512 anon-rmaps would be re-anchored to dst_vma. - Patch 16: alloc_fill_swap_thp() now uses the existing mmap_pmd_aligned() helper so tests don't flake/skip based on VA placement; new MADV_WILLNEED test that watches the PMD-order mTHP swpin counter; swapoff test restructured to use the kselftest_harness ASSERT cleanup blocks (no double swapoff, no verify-after-munmap). - Collected Acks and Reviews. [1] https://lore.kernel.org/all/20260630164143.1595669-1-usama.arif@linux.dev/ Alexandre Ghiti (1): mm: zswap: add range lookup for large-folio swapin Usama Arif (10): mm: add PMD swap entry detection support mm: add PMD swap entry splitting support mm: handle PMD swap entries in fork path mm: swap in PMD swap entries as whole THPs during swapoff mm: handle PMD swap entries in non-present PMD walkers mm: handle PMD swap entries in MADV_WILLNEED mm: handle PMD swap entries in UFFDIO_MOVE mm: handle PMD swap entry faults on swap-in mm: install PMD swap entries on swap-out selftests/mm: add PMD swap entry tests arch/arm64/include/asm/pgtable.h | 4 + arch/loongarch/include/asm/pgtable.h | 17 + arch/powerpc/include/asm/book3s/64/pgtable.h | 15 + arch/riscv/include/asm/pgtable.h | 15 + arch/s390/include/asm/pgtable.h | 15 + arch/x86/include/asm/pgtable.h | 15 + fs/proc/task_mmu.c | 43 +- include/linux/huge_mm.h | 11 + include/linux/leafops.h | 24 +- include/linux/pgtable.h | 17 + include/linux/swap.h | 4 +- include/linux/vm_event_item.h | 1 + include/linux/zswap.h | 7 + mm/hmm.c | 3 +- mm/huge_memory.c | 565 ++++++++++++++- mm/internal.h | 36 + mm/khugepaged.c | 6 + mm/madvise.c | 89 ++- mm/memory.c | 42 +- mm/mempolicy.c | 2 + mm/mincore.c | 45 +- mm/rmap.c | 20 + mm/swap.h | 17 + mm/swap_state.c | 44 ++ mm/swapfile.c | 161 ++++- mm/vmscan.c | 9 +- mm/vmstat.c | 1 + mm/zswap.c | 46 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/pmd_swap.c | 702 +++++++++++++++++++ 30 files changed, 1878 insertions(+), 99 deletions(-) create mode 100644 tools/testing/selftests/mm/pmd_swap.c -- 2.53.0-Meta