From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
ljs@kernel.org, ziy@nvidia.com
Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com,
hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
alex@ghiti.fr, kas@kernel.org, baohua@kernel.org,
dev.jain@arm.com, baolin.wang@linux.alibaba.com,
npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
kernel-team@meta.com, Usama Arif <usama.arif@linux.dev>
Subject: [PATCH 12/13] mm: install PMD swap entries on swap-out
Date: Mon, 27 Apr 2026 03:02:01 -0700 [thread overview]
Message-ID: <20260427100553.2754667-13-usama.arif@linux.dev> (raw)
In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev>
Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap
entries before unmap, losing the huge mapping across the swap
round-trip and forcing khugepaged to rebuild it later. The
contiguous swap range was already secured when the folio was added
to the swap cache (a non-contiguous allocation would have split the
folio earlier), so the PMD can be replaced by a single PMD-level
swap entry instead.
This patch mirrors the existing PTE swap-out path at PMD
granularity:
- shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable
swapcache folios, gated on zswap_never_enabled() since zswap
cannot reconstruct a 2 MB folio from per-page blobs (Best
to handle zswap case separately).
- try_to_unmap_one() now has a PMD branch that calls
set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by
HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the
fallback.
- set_pmd_swap_entry() is the installer. Mirroring the PTE
swap-out sequence at PMD granularity, it clears the present
mapping (keeping the original for rollback), bumps the swap_map
refcount for the folio's 512 slots, drops the exclusive mark if
the page was anon-exclusive, propagates the dirty bit to the
folio so writeback is not lost, and installs a swap PMD that
preserves the original soft-dirty / uffd-wp / exclusive bits.
Any failing step rolls back the present mapping.
The swap entry value matches what 512 PTE swap entries would
encode, so swap_map refcounting is unchanged: each of the 512 slots
carries a count of 1, released individually on later split or
together on swap-in.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/huge_mm.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++
mm/rmap.c | 20 +++++++++
mm/vmscan.c | 14 ++++++-
mm/vmstat.c | 1 +
6 files changed, 115 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93ee6c36d6ea..cbfac4720fc9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -524,6 +524,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
#ifdef CONFIG_THP_SWAP
vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio);
#else
static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7267c06674c0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC_FAILED,
THP_SWPOUT,
THP_SWPOUT_FALLBACK,
+ THP_SWPOUT_PMD,
#endif
#ifdef CONFIG_BALLOON
BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 141ab45adee4..47ff7fb9ee9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -5497,3 +5497,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
trace_remove_migration_pmd(address, pmd_val(pmde));
}
#endif
+
+#ifdef CONFIG_THP_SWAP
+/**
+ * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry.
+ * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and
+ * pvmw->pte NULL (i.e. PMD-mapped).
+ * @folio: The folio being swapped out. Must be in the swap cache.
+ *
+ * This installs a PMD-level swap entry in place of a present PMD mapping,
+ * avoiding the need to split the PMD into PTE-level swap entries.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct folio *folio)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *page = folio_page(folio, 0);
+ bool anon_exclusive;
+ pmd_t pmdval;
+ swp_entry_t entry;
+ pmd_t pmdswp;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return 0;
+
+ VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+
+ if (unlikely(folio_test_swapbacked(folio) !=
+ folio_test_swapcache(folio))) {
+ WARN_ON_ONCE(1);
+ return -EBUSY;
+ }
+
+ flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+ pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (folio_dup_swap(folio, NULL) < 0) {
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -ENOMEM;
+ }
+
+ /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+ anon_exclusive = PageAnonExclusive(page);
+ if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
+ folio_put_swap(folio, NULL);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdval);
+ return -EBUSY;
+ }
+
+ if (pmd_dirty(pmdval))
+ folio_mark_dirty(folio);
+
+ entry = folio->swap;
+ pmdswp = softleaf_to_pmd(entry);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+ if (pmd_uffd_wp(pmdval))
+ pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+ if (anon_exclusive)
+ pmdswp = pmd_swp_mkexclusive(pmdswp);
+ set_pmd_at(mm, haddr, pvmw->pmd, pmdswp);
+
+ folio_remove_rmap_pmd(folio, page, vma);
+ folio_put(folio);
+
+ count_vm_event(THP_SWPOUT_PMD);
+ return 0;
+}
+#endif /* CONFIG_THP_SWAP */
diff --git a/mm/rmap.c b/mm/rmap.c
index 057e18cb80b0..b188213648c5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2077,6 +2077,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_abort;
}
+#ifdef CONFIG_THP_SWAP
+ /*
+ * If the folio is in the swap cache and we're not
+ * asked to split, install a PMD-level swap entry.
+ */
+ if (!(flags & TTU_SPLIT_HUGE_PMD) &&
+ folio_test_anon(folio) &&
+ folio_test_swapcache(folio)) {
+ if (set_pmd_swap_entry(&pvmw, folio))
+ goto walk_abort;
+
+ ensure_on_mmlist(mm);
+ add_mm_counter(mm, MM_ANONPAGES,
+ -HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS,
+ HPAGE_PMD_NR);
+ goto walk_done;
+ }
+#endif
+
if (flags & TTU_SPLIT_HUGE_PMD) {
/*
* We temporarily have to drop the PTL and
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..e895aaade8f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -64,6 +64,7 @@
#include <linux/swapops.h>
#include <linux/sched/sysctl.h>
+#include <linux/zswap.h>
#include "internal.h"
#include "swap.h"
@@ -1330,7 +1331,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum ttu_flags flags = TTU_BATCH_FLUSH;
bool was_swapbacked = folio_test_swapbacked(folio);
- if (folio_test_pmd_mappable(folio))
+ /*
+ * With THP_SWAP, PMD-mappable folios already in the
+ * swap cache can be unmapped with a PMD-level swap
+ * entry, avoiding the cost of splitting the PMD.
+ * Skip this when zswap has been enabled because
+ * zswap stores pages individually and cannot
+ * reconstruct a large folio on swap-in.
+ */
+ if (folio_test_pmd_mappable(folio) &&
+ !(IS_ENABLED(CONFIG_THP_SWAP) &&
+ folio_test_swapcache(folio) &&
+ zswap_never_enabled()))
flags |= TTU_SPLIT_HUGE_PMD;
/*
* Without TTU_SYNC, try_to_unmap will only begin to
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..9b4963a7eb04 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
[I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed",
[I(THP_SWPOUT)] = "thp_swpout",
[I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback",
+ [I(THP_SWPOUT_PMD)] = "thp_swpout_pmd",
#endif
#ifdef CONFIG_BALLOON
[I(BALLOON_INFLATE)] = "balloon_inflate",
--
2.52.0
next prev parent reply other threads:[~2026-04-27 10:07 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-05-13 19:24 ` David Hildenbrand (Arm)
2026-05-29 7:20 ` Dev Jain
2026-05-29 14:47 ` Usama Arif
2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
2026-05-13 13:32 ` David Hildenbrand (Arm)
2026-05-13 17:21 ` Usama Arif
2026-05-13 19:22 ` David Hildenbrand (Arm)
2026-05-29 7:42 ` Dev Jain
2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-05-13 13:35 ` David Hildenbrand (Arm)
2026-05-29 9:34 ` Dev Jain
2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-05-13 19:25 ` David Hildenbrand (Arm)
2026-05-29 11:31 ` Dev Jain
2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
2026-05-30 8:06 ` Dev Jain
2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
2026-05-30 10:52 ` Dev Jain
2026-06-02 12:59 ` Usama Arif
2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-05-26 19:44 ` Alexandre Ghiti
2026-05-29 14:49 ` Usama Arif
2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-04-27 10:01 ` [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-04-27 10:02 ` [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-04-27 10:02 ` Usama Arif [this message]
2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 18:26 ` Zi Yan
2026-04-27 20:12 ` Usama Arif
2026-04-29 12:57 ` Zi Yan
2026-04-28 19:54 ` David Hildenbrand (Arm)
2026-04-29 9:39 ` Usama Arif
2026-04-29 12:52 ` Lorenzo Stoakes
2026-04-29 10:44 ` Kairui Song
2026-04-30 10:38 ` Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260427100553.2754667-13-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=youngjun.park@lge.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.