From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
ljs@kernel.org, ziy@nvidia.com, linux-mm@kvack.org
Cc: ying.huang@linux.alibaba.com, Baoquan He <baoquan.he@linux.dev>,
willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org,
riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr,
kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
baolin.wang@linux.alibaba.com, npache@redhat.com,
Liam R. Howlett <liam@infradead.org>,
ryan.roberts@arm.com, Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
kernel-team@meta.com, Usama Arif <usama.arif@linux.dev>
Subject: [PATCH v3 10/11] mm: install PMD swap entries on swap-out
Date: Fri, 3 Jul 2026 10:38:27 -0700 [thread overview]
Message-ID: <20260703173903.3789516-11-usama.arif@linux.dev> (raw)
In-Reply-To: <20260703173903.3789516-1-usama.arif@linux.dev>
Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap
entries before unmap, losing the huge mapping across the swap
round-trip and forcing khugepaged to rebuild it later. The contiguous
swap range was already secured when the folio was added to the swap
cache (a non-contiguous allocation would have split the folio earlier),
so the PMD can be replaced by a single PMD-level swap entry instead.
This patch mirrors the existing PTE swap-out path at PMD granularity:
- shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable
swapcache folios. zswap is handled by the PMD swap-in users: if any
covered slot currently has a zswap entry, they split the PMD swap
entry and fall back to the per-PTE path.
- try_to_unmap_one() now has a PMD branch that calls
set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by
HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the
fallback.
- set_pmd_swap_entry() is the installer. Mirroring the PTE swap-out
sequence at PMD granularity, it clears the present mapping (keeping
the original for rollback), bumps the swap_map refcount for the
folio's 512 slots, transfers the exclusive state in the swap entry,
propagates the dirty bit to the folio so writeback is not lost,
and installs a swap PMD that preserves the original
soft-dirty / uffd-wp / exclusive bits. Any failing step rolls back
the present mapping.
The swap entry value matches what 512 PTE swap entries would encode, so
swap_map refcounting is unchanged: each of the 512 slots carries a
count of 1, released individually on later split or together on swap-in.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/huge_mm.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++
mm/rmap.c | 20 +++++++++
mm/vmscan.c | 9 +++-
mm/vmstat.c | 1 +
6 files changed, 110 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9ec475ccfc91..b746f8c8db69 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -533,6 +533,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
#ifdef CONFIG_THP_SWAP
vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio);
#else
static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7267c06674c0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC_FAILED,
THP_SWPOUT,
THP_SWPOUT_FALLBACK,
+ THP_SWPOUT_PMD,
#endif
#ifdef CONFIG_BALLOON
BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5fa60324a2f0..7ec81a9c4bc1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -5450,3 +5450,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
trace_remove_migration_pmd(address, pmd_val(pmde));
}
#endif
+
+#ifdef CONFIG_THP_SWAP
+/**
+ * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry.
+ * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and
+ * pvmw->pte NULL (i.e. PMD-mapped).
+ * @folio: The folio being swapped out. Must be in the swap cache.
+ *
+ * This installs a PMD-level swap entry in place of a present PMD mapping,
+ * avoiding the need to split the PMD into PTE-level swap entries.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *page = folio_page(folio, 0);
+ bool anon_exclusive;
+ pmd_t pmdval;
+ swp_entry_t entry;
+ pmd_t pmdswp;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return 0;
+
+ VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+
+ if (unlikely(folio_test_swapbacked(folio) !=
+ folio_test_swapcache(folio))) {
+ WARN_ON_ONCE(1);
+ return -EBUSY;
+ }
+
+ flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+ pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (folio_dup_swap(folio, NULL) < 0) {
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -ENOMEM;
+ }
+
+ /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+ anon_exclusive = PageAnonExclusive(page);
+ if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
+ folio_put_swap(folio, NULL);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -EBUSY;
+ }
+
+ if (pmd_dirty(pmdval))
+ folio_mark_dirty(folio);
+
+ entry = folio->swap;
+ pmdswp = softleaf_to_pmd(entry);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+ if (pmd_uffd_wp(pmdval))
+ pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+ if (anon_exclusive)
+ pmdswp = pmd_swp_mkexclusive(pmdswp);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdswp);
+
+ folio_remove_rmap_pmd(folio, page, vma);
+ folio_put(folio);
+
+ count_vm_event(THP_SWPOUT_PMD);
+ return 0;
+}
+#endif /* CONFIG_THP_SWAP */
diff --git a/mm/rmap.c b/mm/rmap.c
index 0fb7a1b82cf3..ffc7aa62a29e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2079,6 +2079,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_abort;
}
+#ifdef CONFIG_THP_SWAP
+ /*
+ * If the folio is in the swap cache and we're not
+ * asked to split, install a PMD-level swap entry.
+ */
+ if (!(flags & TTU_SPLIT_HUGE_PMD) &&
+ folio_test_anon(folio) &&
+ folio_test_swapcache(folio)) {
+ if (set_pmd_swap_entry(&pvmw, folio))
+ goto walk_abort;
+
+ mm_prepare_for_swap_entries(mm);
+ add_mm_counter(mm, MM_ANONPAGES,
+ -HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS,
+ HPAGE_PMD_NR);
+ goto walk_done;
+ }
+#endif
+
if (flags & TTU_SPLIT_HUGE_PMD) {
/*
* We temporarily have to drop the PTL and
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56fe5393f30f..3d7999c3f1ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1321,7 +1321,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum ttu_flags flags = TTU_BATCH_FLUSH;
bool was_swapbacked = folio_test_swapbacked(folio);
- if (folio_test_pmd_mappable(folio))
+ /*
+ * With THP_SWAP, PMD-mappable folios already in the
+ * swap cache can be unmapped with a PMD-level swap
+ * entry, avoiding the cost of splitting the PMD.
+ */
+ if (folio_test_pmd_mappable(folio) &&
+ !(IS_ENABLED(CONFIG_THP_SWAP) &&
+ folio_test_swapcache(folio)))
flags |= TTU_SPLIT_HUGE_PMD;
/*
* Without TTU_SYNC, try_to_unmap will only begin to
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7b93fbf9af09..629055399987 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
[I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed",
[I(THP_SWPOUT)] = "thp_swpout",
[I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback",
+ [I(THP_SWPOUT_PMD)] = "thp_swpout_pmd",
#endif
#ifdef CONFIG_BALLOON
[I(BALLOON_INFLATE)] = "balloon_inflate",
--
2.53.0-Meta
next prev parent reply other threads:[~2026-07-03 17:40 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
2026-07-03 17:38 ` [PATCH v3 02/11] mm: add PMD swap entry splitting support Usama Arif
2026-07-03 17:38 ` [PATCH v3 03/11] mm: handle PMD swap entries in fork path Usama Arif
2026-07-03 17:38 ` [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin Usama Arif
2026-07-03 17:38 ` [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-07-03 17:38 ` [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-07-03 17:38 ` [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
2026-07-03 17:38 ` [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-07-03 17:38 ` [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-07-03 17:38 ` Usama Arif [this message]
2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
2026-07-04 6:27 ` kernel test robot
2026-07-04 8:30 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703173903.3789516-11-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=baoquan.he@linux.dev \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=youngjun.park@lge.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox