From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
ljs@kernel.org, ziy@nvidia.com, linux-mm@kvack.org
Cc: ying.huang@linux.alibaba.com, Baoquan He <baoquan.he@linux.dev>,
willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org,
riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr,
kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
baolin.wang@linux.alibaba.com, npache@redhat.com,
Liam R. Howlett <liam@infradead.org>,
ryan.roberts@arm.com, Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
kernel-team@meta.com, Usama Arif <usama.arif@linux.dev>
Subject: [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff
Date: Fri, 3 Jul 2026 10:38:22 -0700 [thread overview]
Message-ID: <20260703173903.3789516-6-usama.arif@linux.dev> (raw)
In-Reply-To: <20260703173903.3789516-1-usama.arif@linux.dev>
Add swap_pmd_cache_lookup() to classify the swap cache behind a PMD
swap entry as empty, backed by one PMD-sized folio, or requiring
per-page handling because at least one covered slot has a smaller folio
in the swap cache. PMD swap entries are handled at PMD granularity only
while the covered cache range is empty or backed by a PMD-sized folio;
a split cache forces the entry to be split and retried through the PTE
path.
Add unuse_pmd() and call it from unuse_pmd_range() to swap in
PMD-level swap entries as whole THPs during swapoff. This mirrors
the existing unuse_pte_range() but operates at PMD granularity.
If the PMD-order folio cannot be allocated, the swap cache already
contains per-page folios in the covered range (e.g. split in the swap
cache by deferred_split_scan() or memory_failure() while the PMD swap
entry was installed), or the folio is not uptodate, the PMD swap entry
is split into PTE-level entries via __split_huge_pmd() and a non-zero
error is returned so unuse_pmd_range() falls through to
unuse_pte_range(), which handles the individual entries at order-0.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/swap.h | 17 ++++++
mm/swap_state.c | 44 ++++++++++++++
mm/swapfile.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 215 insertions(+)
diff --git a/mm/swap.h b/mm/swap.h
index 44ab8e1e595b..17c2c57e0da4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -303,6 +303,23 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
bool swap_cache_has_folio(swp_entry_t entry);
struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
+enum swap_pmd_cache {
+ SWAP_PMD_CACHE_EMPTY,
+ SWAP_PMD_CACHE_HUGE,
+ SWAP_PMD_CACHE_SPLIT,
+};
+
+#ifdef CONFIG_THP_SWAP
+enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry,
+ struct folio **foliop);
+#else
+static inline enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry,
+ struct folio **foliop)
+{
+ *foliop = NULL;
+ return SWAP_PMD_CACHE_EMPTY;
+}
+#endif
void swap_cache_del_folio(struct folio *folio);
struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask,
unsigned long orders, struct vm_fault *vmf,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6fd6e3415b71..9b9ca82ace4b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -118,6 +118,50 @@ bool swap_cache_has_folio(swp_entry_t entry)
return swp_tb_is_folio(swp_tb);
}
+#ifdef CONFIG_THP_SWAP
+/**
+ * swap_pmd_cache_lookup - classify the swap cache behind a PMD swap entry
+ * @entry: first swap slot encoded by the PMD swap entry
+ * @foliop: returned PMD-sized folio, with a reference, if present
+ *
+ * A PMD swap entry is a compact page-table encoding for HPAGE_PMD_NR
+ * consecutive swap slots. The swap cache behind those slots can be empty,
+ * one PMD-sized folio, or per-slot folios after the original folio was split.
+ *
+ * Context: Caller must keep @entry valid using the usual swap cache rules.
+ * Return: SWAP_PMD_CACHE_EMPTY if no slot in the PMD range has a cached folio,
+ * SWAP_PMD_CACHE_HUGE if one PMD-sized folio covers the range, or
+ * SWAP_PMD_CACHE_SPLIT if the range needs per-page handling.
+ */
+enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry,
+ struct folio **foliop)
+{
+ unsigned int type = swp_type(entry);
+ pgoff_t offset = swp_offset(entry);
+ struct folio *folio;
+ int i;
+
+ *foliop = NULL;
+
+ folio = swap_cache_get_folio(entry);
+ if (folio) {
+ if (folio_nr_pages(folio) == HPAGE_PMD_NR) {
+ *foliop = folio;
+ return SWAP_PMD_CACHE_HUGE;
+ }
+ folio_put(folio);
+ return SWAP_PMD_CACHE_SPLIT;
+ }
+
+ for (i = 1; i < HPAGE_PMD_NR; i++) {
+ if (swap_cache_has_folio(swp_entry(type, offset + i)))
+ return SWAP_PMD_CACHE_SPLIT;
+ }
+
+ return SWAP_PMD_CACHE_EMPTY;
+}
+#endif
+
/**
* swap_cache_get_shadow - Looks up a shadow in the swap cache.
* @entry: swap entry used for the lookup.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0695dbd1a8b1..664956da60c8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,7 @@
#include <linux/suspend.h>
#include <linux/zswap.h>
#include <linux/plist.h>
+#include <linux/huge_mm.h>
#include <asm/tlbflush.h>
#include <linux/leafops.h>
@@ -2641,6 +2642,147 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
return 0;
}
+/*
+ * unuse_pmd - Map a locked folio at PMD granularity during swapoff.
+ *
+ * The caller provides a locked, swapped-in folio. Returns 0 on success
+ * (PMD was mapped). Returns -EAGAIN if the swap cache folio no longer
+ * matches the entry or the PMD changed under the lock (try_to_unuse will
+ * rescan). Returns -EIO if the folio is not uptodate; in that case the
+ * PMD is split so unuse_pte_range() can handle individual pages.
+ */
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long addr, softleaf_t entry,
+ struct folio *folio)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ pmd_t new_pmd, old_pmd;
+ spinlock_t *ptl;
+ rmap_t rmap_flags = RMAP_NONE;
+ bool exclusive;
+
+ if (unlikely(!folio_matches_swap_entry(folio, entry)))
+ return -EAGAIN;
+
+ if (unlikely(!folio_test_uptodate(folio))) {
+ __split_huge_pmd(vma, pmd, addr, false);
+ return -EIO;
+ }
+
+ page = folio_page(folio, 0);
+
+ ptl = pmd_lock(mm, pmd);
+ old_pmd = pmdp_get(pmd);
+
+ if (!pmd_is_swap_entry(old_pmd) ||
+ softleaf_from_pmd(old_pmd).val != entry.val) {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }
+
+ exclusive = pmd_swp_exclusive(old_pmd);
+
+ /*
+ * Some architectures may have to restore extra metadata to the folio
+ * when reading from swap. This metadata may be indexed by swap entry
+ * so this must be called before folio_put_swap().
+ */
+ arch_swap_restore(folio_swap(entry, folio), folio);
+
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+ new_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+ new_pmd = pmd_mkold(new_pmd);
+ if (pmd_swp_soft_dirty(old_pmd))
+ new_pmd = pmd_mksoft_dirty(new_pmd);
+ if (pmd_swp_uffd_wp(old_pmd))
+ new_pmd = pmd_mkuffd_wp(new_pmd);
+
+ if (exclusive)
+ rmap_flags |= RMAP_EXCLUSIVE;
+
+ folio_get(folio);
+ if (!folio_test_anon(folio))
+ folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
+ else
+ folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags);
+
+ set_pmd_at(mm, addr, pmd, new_pmd);
+ folio_put_swap(folio, NULL);
+
+ spin_unlock(ptl);
+
+ folio_free_swap(folio);
+ return 0;
+}
+
+/*
+ * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success.
+ * If the swap cache no longer has one PMD-sized folio, zswap may require
+ * per-page loading, or a PMD-order allocation/read fails, split the PMD so
+ * the caller can fall back to unuse_pte_range(). Otherwise propagates the
+ * error from unuse_pmd().
+ */
+static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long addr, softleaf_t entry)
+{
+ struct folio *folio;
+ enum swap_pmd_cache cache_state;
+ int ret;
+
+ cache_state = swap_pmd_cache_lookup(entry, &folio);
+ if (cache_state == SWAP_PMD_CACHE_SPLIT) {
+ ret = -EAGAIN;
+ goto split_fallback;
+ }
+ if (!folio) {
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = addr,
+ .real_address = addr,
+ .pmd = pmd,
+ };
+
+ if (zswap_range_has_entry(entry, HPAGE_PMD_NR)) {
+ ret = -EAGAIN;
+ goto split_fallback;
+ }
+
+ folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
+ BIT(HPAGE_PMD_ORDER), &vmf, NULL, 0);
+ if (IS_ERR_OR_NULL(folio)) {
+ ret = folio ? PTR_ERR(folio) : -ENOMEM;
+ goto split_fallback;
+ }
+ }
+
+ folio_lock(folio);
+ folio_wait_writeback(folio);
+ /*
+ * If the cached folio is no longer PMD-sized (e.g. split in the
+ * swap cache by deferred_split_scan() or memory_failure() while
+ * the PMD swap entry was installed), the PMD swap entry no longer
+ * maps a single contiguous folio. Split the PMD swap entry so
+ * unuse_pte_range() can swap the per-slot folios in individually.
+ */
+ if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+ folio_unlock(folio);
+ folio_put(folio);
+ ret = -EAGAIN;
+ goto split_fallback;
+ }
+ ret = unuse_pmd(vma, pmd, addr, entry, folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ return ret;
+
+split_fallback:
+ __split_huge_pmd(vma, pmd, addr, false);
+ return ret;
+}
+
static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
unsigned int type)
@@ -2653,6 +2795,18 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
do {
cond_resched();
next = pmd_addr_end(addr, end);
+
+ pmd_t pmdval = pmdp_get(pmd);
+
+ if (pmd_is_swap_entry(pmdval)) {
+ softleaf_t sl = softleaf_from_pmd(pmdval);
+
+ if (swp_type(sl) == type) {
+ if (!unuse_pmd_entry(vma, pmd, addr, sl))
+ continue;
+ }
+ }
+
ret = unuse_pte_range(vma, pmd, addr, next, type);
if (ret)
return ret;
--
2.53.0-Meta
next prev parent reply other threads:[~2026-07-03 17:39 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
2026-07-03 17:38 ` [PATCH v3 02/11] mm: add PMD swap entry splitting support Usama Arif
2026-07-03 17:38 ` [PATCH v3 03/11] mm: handle PMD swap entries in fork path Usama Arif
2026-07-03 17:38 ` [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin Usama Arif
2026-07-03 17:38 ` Usama Arif [this message]
2026-07-03 17:38 ` [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-07-03 17:38 ` [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
2026-07-03 17:38 ` [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-07-03 17:38 ` [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-07-03 17:38 ` [PATCH v3 10/11] mm: install PMD swap entries on swap-out Usama Arif
2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
2026-07-04 6:27 ` kernel test robot
2026-07-04 8:30 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703173903.3789516-6-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=baoquan.he@linux.dev \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=youngjun.park@lge.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox