From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
ljs@kernel.org, ziy@nvidia.com
Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com,
hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
alex@ghiti.fr, kas@kernel.org, baohua@kernel.org,
dev.jain@arm.com, baolin.wang@linux.alibaba.com,
npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
kernel-team@meta.com, Usama Arif <usama.arif@linux.dev>
Subject: [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE
Date: Mon, 27 Apr 2026 03:01:59 -0700 [thread overview]
Message-ID: <20260427100553.2754667-11-usama.arif@linux.dev> (raw)
In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev>
move_pages_huge_pmd() returned -ENOENT for any non-trans_huge,
non-migration PMD, which fails aligned UFFDIO_MOVE on a swapped-out
THP -- the PMD swap entry is a perfectly valid mapping that should
move whole. Splitting via the move_pages_ptes() fallback isn't a
substitute either: __split_huge_pmd_locked() splits a PMD swap entry
into HPAGE_PMD_NR PTE swap entries pointing at the same swap-cache
folio, but move_swap_pte() refuses any swap-cache folio that is still
large and returns -EBUSY.
Add move_swap_pmd(), modeled on move_swap_pte(), that moves the swap
entry whole-PMD and re-anchors the swap-cache folio's anon rmap to
the destination VMA. Reject !pmd_swp_exclusive() entries with -EBUSY
to preserve UFFDIO_MOVE's single-owner semantics, propagate
soft-dirty, and carry the deposited page table across with the
entry.
The dispatcher in move_pages_huge_pmd() now waits for migration on a
PMD migration entry (matching the PTE path) and routes PMD swap
entries through move_swap_pmd() after pinning the swap device,
fetching and locking any cached folio, and arming an mmu_notifier
range so secondary MMUs see the move.
If the swap-cache folio was split (e.g. by deferred_split_scan or
memory_failure) between swap-out and UFFDIO_MOVE, src_folio is no
longer PMD-sized but the PMD swap entry still covers all 512 slots.
Moving the entry whole would only re-anchor one folio's anon rmap,
leaving the other 511 with a stale anon_vma. Return -EBUSY in this
case, matching move_pages_pte()'s rejection of large folios, so the
caller falls back to PTE-level moves.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/huge_memory.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 112 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 109e4dc4a167..bfcc9b274be7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2871,6 +2871,62 @@ int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
#endif
#ifdef CONFIG_USERFAULTFD
+/*
+ * Move a PMD-level swap entry from src_pmd to dst_pmd. Both PMD locks are
+ * acquired here; src_folio (if present) must already be locked. The deposited
+ * page table backing the source THP is moved across with the entry.
+ */
+static int move_swap_pmd(struct mm_struct *mm, struct vm_area_struct *dst_vma,
+ unsigned long dst_addr, unsigned long src_addr,
+ pmd_t *dst_pmd, pmd_t *src_pmd,
+ pmd_t orig_dst_pmd, pmd_t orig_src_pmd,
+ spinlock_t *dst_ptl, spinlock_t *src_ptl,
+ struct folio *src_folio, swp_entry_t entry)
+{
+ pgtable_t src_pgtable;
+ pmd_t moved_pmd;
+
+ /*
+ * The folio may have been freed and reused for a different swap entry
+ * while it was unlocked. Re-verify the association.
+ */
+ if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+ entry.val != src_folio->swap.val))
+ return -EAGAIN;
+
+ double_pt_lock(dst_ptl, src_ptl);
+
+ if (!pmd_same(*src_pmd, orig_src_pmd) ||
+ !pmd_same(*dst_pmd, orig_dst_pmd)) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
+
+ /*
+ * If the folio is in the swap cache, re-anchor its anon rmap to the
+ * destination VMA so a future swap-in fault at dst_addr finds it.
+ * Otherwise, re-check that no folio was newly inserted under us.
+ */
+ if (src_folio) {
+ folio_move_anon_rmap(src_folio, dst_vma);
+ src_folio->index = linear_page_index(dst_vma, dst_addr);
+ } else if (swap_cache_has_folio(entry)) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
+
+ moved_pmd = pmdp_huge_get_and_clear(mm, src_addr, src_pmd);
+ if (pgtable_supports_soft_dirty())
+ moved_pmd = pmd_swp_mksoft_dirty(moved_pmd);
+ set_pmd_at(mm, dst_addr, dst_pmd, moved_pmd);
+
+ src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+ pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+
+ double_pt_unlock(dst_ptl, src_ptl);
+ return 0;
+}
+
/*
* The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
* the caller, but it must return after releasing the page_table_lock.
@@ -2905,11 +2961,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
}
if (!pmd_trans_huge(src_pmdval)) {
- spin_unlock(src_ptl);
if (pmd_is_migration_entry(src_pmdval)) {
+ spin_unlock(src_ptl);
pmd_migration_entry_wait(mm, &src_pmdval);
return -EAGAIN;
}
+ if (pmd_is_swap_entry(src_pmdval)) {
+ swp_entry_t entry;
+ struct swap_info_struct *si;
+
+ /*
+ * UFFDIO_MOVE on anon mappings requires single-owner
+ * semantics; refuse to move a shared swap entry.
+ */
+ if (!pmd_swp_exclusive(src_pmdval)) {
+ spin_unlock(src_ptl);
+ return -EBUSY;
+ }
+
+ entry = softleaf_from_pmd(src_pmdval);
+ spin_unlock(src_ptl);
+
+ /* Pin the swap device against a racing swapoff. */
+ si = get_swap_device(entry);
+ if (unlikely(!si))
+ return -EAGAIN;
+
+ src_folio = swap_cache_get_folio(entry);
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
+ mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ if (src_folio) {
+ folio_lock(src_folio);
+ if (folio_nr_pages(src_folio) != HPAGE_PMD_NR) {
+ err = -EBUSY;
+ folio_unlock(src_folio);
+ folio_put(src_folio);
+ mmu_notifier_invalidate_range_end(&range);
+ put_swap_device(si);
+ return err;
+ }
+ }
+
+ dst_ptl = pmd_lockptr(mm, dst_pmd);
+ err = move_swap_pmd(mm, dst_vma, dst_addr, src_addr,
+ dst_pmd, src_pmd, dst_pmdval,
+ src_pmdval, dst_ptl, src_ptl,
+ src_folio, entry);
+
+ mmu_notifier_invalidate_range_end(&range);
+ if (src_folio) {
+ folio_unlock(src_folio);
+ folio_put(src_folio);
+ }
+ put_swap_device(si);
+ return err;
+ }
+ spin_unlock(src_ptl);
return -ENOENT;
}
--
2.52.0
next prev parent reply other threads:[~2026-04-27 10:06 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-04-27 10:01 ` Usama Arif [this message]
2026-04-27 10:02 ` [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-04-27 10:02 ` [PATCH 12/13] mm: install PMD swap entries on swap-out Usama Arif
2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 18:26 ` Zi Yan
2026-04-27 20:12 ` Usama Arif
2026-04-28 19:54 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260427100553.2754667-11-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=youngjun.park@lge.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox