From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
ljs@kernel.org, ziy@nvidia.com
Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com,
hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
alex@ghiti.fr, kas@kernel.org, baohua@kernel.org,
dev.jain@arm.com, baolin.wang@linux.alibaba.com,
npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
kernel-team@meta.com, Usama Arif <usama.arif@linux.dev>
Subject: [PATCH 11/13] mm: handle PMD swap entry faults on swap-in
Date: Mon, 27 Apr 2026 03:02:00 -0700 [thread overview]
Message-ID: <20260427100553.2754667-12-usama.arif@linux.dev> (raw)
In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev>
Add do_huge_pmd_swap_page() and dispatch to it from __handle_mm_fault()
when vmf->orig_pmd encodes a swap entry. The handler resolves the
entire 2 MB mapping in one shot, mirroring do_swap_page() (PTE path)
at PMD granularity:
- Look up the folio in the swap cache; on a miss, allocate a
PMD-order folio and read from swap (shared with unuse_pmd_entry()
via swapin_alloc_pmd_folio() in mm/swap_state.c).
- After locking, re-validate that the folio still corresponds to our
entry and is still PMD-sized. Between the unlocked cache lookup
and the lock, a racing swap-in on the same entry may have removed
it from the cache via folio_free_swap(), or reclaim / memory_failure
/ deferred-split may have split the folio into smaller folios.
- Restore soft_dirty and uffd_wp from the swap PMD. Map writable
only when the entry was exclusive, the VMA permits writes, and
uffd-wp is not armed. Drop the exclusive marker when the cached
folio is under writeback to an SWP_STABLE_WRITES backend (zram,
encrypted) so the PMD is mapped read-only; a later write COWs
into a fresh folio rather than corrupting the in-flight writeback.
Mirrors do_swap_page().
- When the resulting PMD is read-only but the fault was a write,
update vmf->orig_pmd and call wp_huge_pmd() in the same handler
to COW immediately rather than forcing a second fault. Mask
VM_FAULT_FALLBACK from its return: a PMD-COW that splits to
PTE-level is normal, but the bit is part of VM_FAULT_ERROR and
arch fault handlers BUG() on it without SIGBUS/HWPOISON/SIGSEGV.
Requires exposing wp_huge_pmd() via mm/internal.h.
- Free the swap slot via should_try_to_free_swap() (hoisted from
mm/memory.c into mm/internal.h so PTE- and PMD-level swap-in
share the heuristic).
When PMD-order resources are unavailable (folio allocation fails,
the cached folio was split, memcg charge fails, or swapin_folio()
races) split the PMD swap entry into 512 PTE swap entries via
__split_huge_pmd() and return 0. The fault retries and do_swap_page()
takes over per-PTE. This avoids returning VM_FAULT_OOM for transient
PMD-order allocation failures.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/huge_mm.h | 9 ++
mm/huge_memory.c | 197 ++++++++++++++++++++++++++++++++++++++++
mm/internal.h | 36 ++++++++
mm/memory.c | 40 +-------
mm/swap_state.c | 2 +-
5 files changed, 247 insertions(+), 37 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2949e5acff35..93ee6c36d6ea 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -522,6 +522,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+#ifdef CONFIG_THP_SWAP
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf);
+#else
+static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+ return 0;
+}
+#endif
+
extern struct folio *huge_zero_folio;
extern unsigned long huge_zero_pfn;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bfcc9b274be7..141ab45adee4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2375,6 +2375,203 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
return 0;
}
+#ifdef CONFIG_THP_SWAP
+/**
+ * do_huge_pmd_swap_page() - Handle a fault on a PMD-level swap entry.
+ * @vmf: Fault context. vmf->orig_pmd contains the swap PMD.
+ *
+ * Looks up the folio in the swap cache, and if it is a PMD-sized folio,
+ * maps it directly at the PMD level. If the folio is not in the swap
+ * cache, allocates a PMD-sized folio and reads from swap. On allocation
+ * failure, splits the PMD swap entry into PTE-level entries and retries
+ * at PTE granularity.
+ *
+ * Return: VM_FAULT_* flags.
+ */
+vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ struct folio *folio;
+ struct page *page;
+ struct swap_info_struct *si;
+ unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+ softleaf_t entry;
+ swp_entry_t swp_entry;
+ pmd_t pmd;
+ vm_fault_t ret = 0;
+ bool exclusive;
+ rmap_t rmap_flags = RMAP_NONE;
+
+ entry = softleaf_from_pmd(vmf->orig_pmd);
+ if (unlikely(!softleaf_is_swap(entry)))
+ return 0;
+
+ swp_entry = entry;
+
+ /* Prevent swapoff from happening to us. */
+ si = get_swap_device(swp_entry);
+ if (unlikely(!si))
+ return 0;
+
+ folio = swap_cache_get_folio(swp_entry);
+ if (!folio) {
+ folio = swapin_alloc_pmd_folio(swp_entry, mm);
+ if (!folio)
+ goto split_fallback;
+
+ /* Had to read from swap area: Major fault */
+ ret = VM_FAULT_MAJOR;
+ count_vm_event(PGMAJFAULT);
+ count_memcg_event_mm(mm, PGMAJFAULT);
+ }
+
+ ret |= folio_lock_or_retry(folio, vmf);
+ if (ret & VM_FAULT_RETRY)
+ goto out_release;
+
+ /* Verify the folio is still in swap cache and matches our entry */
+ if (unlikely(!folio_matches_swap_entry(folio, swp_entry)))
+ goto out_page;
+
+ /*
+ * Folio should be PMD-sized; if not (e.g. split in swap cache),
+ * split the PMD swap entry and retry at PTE level.
+ */
+ if (folio_nr_pages(folio) != HPAGE_PMD_NR) {
+ folio_unlock(folio);
+ folio_put(folio);
+ goto split_fallback;
+ }
+
+ if (unlikely(!folio_test_uptodate(folio))) {
+ ret = VM_FAULT_SIGBUS;
+ goto out_page;
+ }
+
+ page = folio_page(folio, 0);
+ arch_swap_restore(folio_swap(swp_entry, folio), folio);
+
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !folio_test_lru(folio))
+ lru_add_drain();
+
+ folio_throttle_swaprate(folio, GFP_KERNEL);
+
+ /* Lock the PMD and verify it hasn't changed */
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (unlikely(!pmd_same(vmf->orig_pmd, pmdp_get(vmf->pmd)))) {
+ spin_unlock(vmf->ptl);
+ goto out_page;
+ }
+
+ exclusive = pmd_swp_exclusive(vmf->orig_pmd);
+
+ /*
+ * Some swap backends (e.g. zram) don't support concurrent page
+ * modifications while under writeback. If we map exclusive on such
+ * a backend while the folio is still under writeback, the writeback
+ * may see partial modifications and corrupt the swap slot. Drop the
+ * exclusive marker and only map R/O for that case; further GUP
+ * references can't appear once the page is fully unmapped, so this
+ * is safe.
+ */
+ if (exclusive && folio_test_writeback(folio) &&
+ data_race(si->flags & SWP_STABLE_WRITES))
+ exclusive = false;
+
+ /*
+ * Set up the PMD mapping. Similar to do_swap_page() but at PMD level.
+ */
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+
+ pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+ pmd = pmd_mkyoung(pmd);
+
+ if (pmd_swp_soft_dirty(vmf->orig_pmd))
+ pmd = pmd_mksoft_dirty(pmd);
+ if (pmd_swp_uffd_wp(vmf->orig_pmd))
+ pmd = pmd_mkuffd_wp(pmd);
+
+ /*
+ * Check exclusivity to determine if we can map writable.
+ */
+ if (exclusive || folio_ref_count(folio) == 1) {
+ if ((vma->vm_flags & VM_WRITE) &&
+ !userfaultfd_huge_pmd_wp(vma, pmd) &&
+ !pmd_needs_soft_dirty_wp(vma, pmd)) {
+ pmd = pmd_mkwrite(pmd, vma);
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ pmd = pmd_mkdirty(pmd);
+ vmf->flags &= ~FAULT_FLAG_WRITE;
+ }
+ }
+ rmap_flags |= RMAP_EXCLUSIVE;
+ }
+
+ flush_icache_pages(vma, page, HPAGE_PMD_NR);
+
+ if (!folio_test_anon(folio))
+ folio_add_new_anon_rmap(folio, vma, haddr, rmap_flags);
+ else
+ folio_add_anon_rmap_pmd(folio, page, vma, haddr, rmap_flags);
+
+ folio_put_swap(folio, NULL);
+
+ set_pmd_at(mm, haddr, vmf->pmd, pmd);
+ update_mmu_cache_pmd(vma, haddr, vmf->pmd);
+
+ /* Update orig_pmd for any follow-up wp_huge_pmd() below. */
+ vmf->orig_pmd = pmd;
+
+ /*
+ * Conditionally try to free up the swap cache. Do it after mapping,
+ * so raced page faults will likely see the folio in swap cache and
+ * wait on the folio lock.
+ */
+ if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags))
+ folio_free_swap(folio);
+
+ spin_unlock(vmf->ptl);
+
+ folio_unlock(folio);
+ put_swap_device(si);
+
+ /*
+ * If the write fault wasn't satisfied above (folio is shared without
+ * exclusivity), fall through to wp_huge_pmd to handle COW or
+ * userfaultfd-wp without forcing a second fault.
+ *
+ * wp_huge_pmd() may return VM_FAULT_FALLBACK if it had to split the
+ * PMD; that's a normal outcome — the natural PTE-level refault will
+ * complete the COW. Mask it so callers (and the arch fault handler)
+ * don't see VM_FAULT_FALLBACK as a fatal VM_FAULT_ERROR.
+ */
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ vm_fault_t wp_ret = wp_huge_pmd(vmf);
+
+ wp_ret &= ~VM_FAULT_FALLBACK;
+ ret |= wp_ret;
+ if (ret & VM_FAULT_ERROR)
+ ret &= VM_FAULT_ERROR;
+ }
+
+ return ret;
+
+out_page:
+ folio_unlock(folio);
+out_release:
+ folio_put(folio);
+ put_swap_device(si);
+ return ret;
+
+split_fallback:
+ __split_huge_pmd(vma, vmf->pmd, haddr, false);
+ put_swap_device(si);
+ return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
{
pgtable_t pgtable;
diff --git a/mm/internal.h b/mm/internal.h
index 7de489689f54..c522bff72688 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -508,6 +508,42 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
}
vm_fault_t do_swap_page(struct vm_fault *vmf);
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf);
+
+/*
+ * Check if we should call folio_free_swap to free the swap cache.
+ * folio_free_swap only frees the swap cache to release the slot if swap
+ * count is zero, so we don't need to check the swap count here.
+ */
+static inline bool should_try_to_free_swap(struct swap_info_struct *si,
+ struct folio *folio,
+ struct vm_area_struct *vma,
+ unsigned int extra_refs,
+ unsigned int fault_flags)
+{
+ if (!folio_test_swapcache(folio))
+ return false;
+ /*
+ * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
+ * cache can help save some IO or memory overhead, but these devices
+ * are fast, and meanwhile, swap cache pinning the slot deferring the
+ * release of metadata or fragmentation is a more critical issue.
+ */
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+ return true;
+ if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
+ folio_test_mlocked(folio))
+ return true;
+ /*
+ * If we want to map a page that's in the swapcache writable, we
+ * have to detect via the refcount if we're really the exclusive
+ * user. Try freeing the swapcache to get rid of the swapcache
+ * reference only in case it's likely that we'll be the exclusive user.
+ */
+ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
+ folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+}
+
void folio_rotate_reclaimable(struct folio *folio);
bool __folio_end_writeback(struct folio *folio);
void deactivate_file_folio(struct folio *folio);
diff --git a/mm/memory.c b/mm/memory.c
index 8aa90afd601a..3006e1bc2bd7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4481,40 +4481,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
return 0;
}
-/*
- * Check if we should call folio_free_swap to free the swap cache.
- * folio_free_swap only frees the swap cache to release the slot if swap
- * count is zero, so we don't need to check the swap count here.
- */
-static inline bool should_try_to_free_swap(struct swap_info_struct *si,
- struct folio *folio,
- struct vm_area_struct *vma,
- unsigned int extra_refs,
- unsigned int fault_flags)
-{
- if (!folio_test_swapcache(folio))
- return false;
- /*
- * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
- * cache can help save some IO or memory overhead, but these devices
- * are fast, and meanwhile, swap cache pinning the slot deferring the
- * release of metadata or fragmentation is a more critical issue.
- */
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
- return true;
- if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
- folio_test_mlocked(folio))
- return true;
- /*
- * If we want to map a page that's in the swapcache writable, we
- * have to detect via the refcount if we're really the exclusive
- * user. Try freeing the swapcache to get rid of the swapcache
- * reference only in case it's likely that we'll be the exclusive user.
- */
- return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
-}
-
static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
{
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -6233,8 +6199,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
return VM_FAULT_FALLBACK;
}
-/* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
+vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
@@ -6518,6 +6483,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (pmd_is_migration_entry(vmf.orig_pmd))
pmd_migration_entry_wait(mm, vmf.pmd);
+ else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+ pmd_is_swap_entry(vmf.orig_pmd))
+ return do_huge_pmd_swap_page(&vmf);
return 0;
}
if (pmd_trans_huge(vmf.orig_pmd)) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c2e8c76658f5..19c6759006bb 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -592,7 +592,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
*
* Allocate a HPAGE_PMD_ORDER folio, charge it to @mm's memcg for @entry, and
* issue the swap-in via swapin_folio(). Used by callers that need to map a
- * PMD swap entry as a whole THP (PMD swapoff).
+ * PMD swap entry as a whole THP (PMD swap-in fault and swapoff).
*
* Return: the swapped-in folio, or NULL on alloc/charge/swapin failure (in
* which case the caller should fall back to splitting the PMD).
--
2.52.0
next prev parent reply other threads:[~2026-04-27 10:07 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-04-27 10:01 ` [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-04-27 10:02 ` Usama Arif [this message]
2026-04-27 10:02 ` [PATCH 12/13] mm: install PMD swap entries on swap-out Usama Arif
2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 18:26 ` Zi Yan
2026-04-27 20:12 ` Usama Arif
2026-04-28 19:54 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260427100553.2754667-12-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=youngjun.park@lge.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox