From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
ljs@kernel.org, ziy@nvidia.com
Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com,
hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
alex@ghiti.fr, kas@kernel.org, baohua@kernel.org,
dev.jain@arm.com, baolin.wang@linux.alibaba.com,
npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
kernel-team@meta.com, Usama Arif <usama.arif@linux.dev>
Subject: [PATCH 12/13] mm: install PMD swap entries on swap-out
Date: Mon, 27 Apr 2026 03:02:01 -0700 [thread overview]
Message-ID: <20260427100553.2754667-13-usama.arif@linux.dev> (raw)
In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev>
Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap
entries before unmap, losing the huge mapping across the swap
round-trip and forcing khugepaged to rebuild it later. The
contiguous swap range was already secured when the folio was added
to the swap cache (a non-contiguous allocation would have split the
folio earlier), so the PMD can be replaced by a single PMD-level
swap entry instead.
This patch mirrors the existing PTE swap-out path at PMD
granularity:
- shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable
swapcache folios, gated on zswap_never_enabled() since zswap
cannot reconstruct a 2 MB folio from per-page blobs (Best
to handle zswap case separately).
- try_to_unmap_one() now has a PMD branch that calls
set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by
HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the
fallback.
- set_pmd_swap_entry() is the installer. Mirroring the PTE
swap-out sequence at PMD granularity, it clears the present
mapping (keeping the original for rollback), bumps the swap_map
refcount for the folio's 512 slots, drops the exclusive mark if
the page was anon-exclusive, propagates the dirty bit to the
folio so writeback is not lost, and installs a swap PMD that
preserves the original soft-dirty / uffd-wp / exclusive bits.
Any failing step rolls back the present mapping.
The swap entry value matches what 512 PTE swap entries would
encode, so swap_map refcounting is unchanged: each of the 512 slots
carries a count of 1, released individually on later split or
together on swap-in.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/huge_mm.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++
mm/rmap.c | 20 +++++++++
mm/vmscan.c | 14 ++++++-
mm/vmstat.c | 1 +
6 files changed, 115 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93ee6c36d6ea..cbfac4720fc9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -524,6 +524,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
#ifdef CONFIG_THP_SWAP
vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio);
#else
static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7267c06674c0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC_FAILED,
THP_SWPOUT,
THP_SWPOUT_FALLBACK,
+ THP_SWPOUT_PMD,
#endif
#ifdef CONFIG_BALLOON
BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 141ab45adee4..47ff7fb9ee9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -5497,3 +5497,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
trace_remove_migration_pmd(address, pmd_val(pmde));
}
#endif
+
+#ifdef CONFIG_THP_SWAP
+/**
+ * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry.
+ * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and
+ * pvmw->pte NULL (i.e. PMD-mapped).
+ * @folio: The folio being swapped out. Must be in the swap cache.
+ *
+ * This installs a PMD-level swap entry in place of a present PMD mapping,
+ * avoiding the need to split the PMD into PTE-level swap entries.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *page = folio_page(folio, 0);
+ bool anon_exclusive;
+ pmd_t pmdval;
+ swp_entry_t entry;
+ pmd_t pmdswp;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return 0;
+
+ VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+
+ if (unlikely(folio_test_swapbacked(folio) !=
+ folio_test_swapcache(folio))) {
+ WARN_ON_ONCE(1);
+ return -EBUSY;
+ }
+
+ flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+ pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (folio_dup_swap(folio, NULL) < 0) {
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -ENOMEM;
+ }
+
+ /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+ anon_exclusive = PageAnonExclusive(page);
+ if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
+ folio_put_swap(folio, NULL);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -EBUSY;
+ }
+
+ if (pmd_dirty(pmdval))
+ folio_mark_dirty(folio);
+
+ entry = folio->swap;
+ pmdswp = softleaf_to_pmd(entry);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+ if (pmd_uffd_wp(pmdval))
+ pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+ if (anon_exclusive)
+ pmdswp = pmd_swp_mkexclusive(pmdswp);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdswp);
+
+ folio_remove_rmap_pmd(folio, page, vma);
+ folio_put(folio);
+
+ count_vm_event(THP_SWPOUT_PMD);
+ return 0;
+}
+#endif /* CONFIG_THP_SWAP */
diff --git a/mm/rmap.c b/mm/rmap.c
index 057e18cb80b0..b188213648c5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2077,6 +2077,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_abort;
}
+#ifdef CONFIG_THP_SWAP
+ /*
+ * If the folio is in the swap cache and we're not
+ * asked to split, install a PMD-level swap entry.
+ */
+ if (!(flags & TTU_SPLIT_HUGE_PMD) &&
+ folio_test_anon(folio) &&
+ folio_test_swapcache(folio)) {
+ if (set_pmd_swap_entry(&pvmw, folio))
+ goto walk_abort;
+
+ ensure_on_mmlist(mm);
+ add_mm_counter(mm, MM_ANONPAGES,
+ -HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS,
+ HPAGE_PMD_NR);
+ goto walk_done;
+ }
+#endif
+
if (flags & TTU_SPLIT_HUGE_PMD) {
/*
* We temporarily have to drop the PTL and
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..e895aaade8f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -64,6 +64,7 @@
#include <linux/swapops.h>
#include <linux/sched/sysctl.h>
+#include <linux/zswap.h>
#include "internal.h"
#include "swap.h"
@@ -1330,7 +1331,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum ttu_flags flags = TTU_BATCH_FLUSH;
bool was_swapbacked = folio_test_swapbacked(folio);
- if (folio_test_pmd_mappable(folio))
+ /*
+ * With THP_SWAP, PMD-mappable folios already in the
+ * swap cache can be unmapped with a PMD-level swap
+ * entry, avoiding the cost of splitting the PMD.
+ * Skip this when zswap has been enabled because
+ * zswap stores pages individually and cannot
+ * reconstruct a large folio on swap-in.
+ */
+ if (folio_test_pmd_mappable(folio) &&
+ !(IS_ENABLED(CONFIG_THP_SWAP) &&
+ folio_test_swapcache(folio) &&
+ zswap_never_enabled()))
flags |= TTU_SPLIT_HUGE_PMD;
/*
* Without TTU_SYNC, try_to_unmap will only begin to
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..9b4963a7eb04 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
[I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed",
[I(THP_SWPOUT)] = "thp_swpout",
[I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback",
+ [I(THP_SWPOUT_PMD)] = "thp_swpout_pmd",
#endif
#ifdef CONFIG_BALLOON
[I(BALLOON_INFLATE)] = "balloon_inflate",
--
2.52.0
next prev parent reply other threads:[~2026-04-27 10:07 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-04-27 10:01 ` [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-04-27 10:02 ` [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-04-27 10:02 ` Usama Arif [this message]
2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 18:26 ` Zi Yan
2026-04-27 20:12 ` Usama Arif
2026-04-28 19:54 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260427100553.2754667-13-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=youngjun.park@lge.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox