From: Lance Yang <lance.yang@linux.dev>
To: usama.arif@linux.dev
Cc: akpm@linux-foundation.org, david@kernel.org, chrisl@kernel.org,
kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com,
ying.huang@linux.alibaba.com, baoquan.he@linux.dev,
willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org,
riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr,
kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
baolin.wang@linux.alibaba.com, npache@redhat.com,
liam@infradead.org, ryan.roberts@arm.com, vbabka@kernel.org,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
kernel-team@meta.com, linux-mm@kvack.org
Subject: Re: [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE
Date: Fri, 12 Jun 2026 16:50:27 +0800 [thread overview]
Message-ID: <20260612085027.5401-1-lance.yang@linux.dev> (raw)
In-Reply-To: <20260602142537.198755-14-usama.arif@linux.dev>
+Cc linux-mm
On Tue, Jun 02, 2026 at 07:24:21AM -0700, Usama Arif wrote:
[...]
>@@ -2846,11 +2902,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> }
>
> if (!pmd_trans_huge(src_pmdval)) {
>- spin_unlock(src_ptl);
> if (pmd_is_migration_entry(src_pmdval)) {
>+ spin_unlock(src_ptl);
> pmd_migration_entry_wait(mm, &src_pmdval);
> return -EAGAIN;
> }
>+ if (pmd_is_swap_entry(src_pmdval)) {
Looks buggy ... unless I missed something ...
>+ swp_entry_t entry;
>+ struct swap_info_struct *si;
>+
>+ /*
>+ * UFFDIO_MOVE on anon mappings requires single-owner
>+ * semantics; refuse to move a shared swap entry.
>+ */
>+ if (!pmd_swp_exclusive(src_pmdval)) {
>+ spin_unlock(src_ptl);
>+ return -EBUSY;
>+ }
>+
>+ entry = softleaf_from_pmd(src_pmdval);
>+ spin_unlock(src_ptl);
>+
>+ /* Pin the swap device against a racing swapoff. */
>+ si = get_swap_device(entry);
>+ if (unlikely(!si))
>+ return -EAGAIN;
>+
>+ src_folio = swap_cache_get_folio(entry);
We only check the first swap slot. Imagine we have something like this
after the PMD-sized swapcache folio was split while the PMD swap entry
was installed:
page table:
src PMD -> swap entry S
swap cache:
S + 0 -> no folio
S + 1 -> order-0 folio in the swap cache
S + 2 -> no folio
S + 3 -> order-0 folio in the swap cache
...
>+
>+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
>+ mm, src_addr,
>+ src_addr + HPAGE_PMD_SIZE);
>+ mmu_notifier_invalidate_range_start(&range);
>+
>+ if (src_folio) {
>+ folio_lock(src_folio);
>+ if (folio_nr_pages(src_folio) != HPAGE_PMD_NR) {
If S has a non-PMD-sized folio, this returns -EBUSY.
>+ err = -EBUSY;
>+ folio_unlock(src_folio);
>+ folio_put(src_folio);
>+ mmu_notifier_invalidate_range_end(&range);
>+ put_swap_device(si);
>+ return err;
>+ }
>+ }
>+
>+ dst_ptl = pmd_lockptr(mm, dst_pmd);
But if S has no folio, the initial lookup passes src_folio == NULL to
move_swap_pmd(), , which only rechecks S:
if (src_folio) {
[...]
} else if (swap_cache_has_folio(entry)) {
double_pt_unlock(dst_ptl, src_ptl);
return -EAGAIN;
}
So if S is empty, the move can still go ahead even if S + 1 ... S + 511
contain folios in the swap cache.
>+ err = move_swap_pmd(mm, dst_vma, dst_addr, src_addr,
>+ dst_pmd, src_pmd, dst_pmdval,
>+ src_pmdval, dst_ptl, src_ptl,
>+ src_folio, entry);
>+
In that case, checking only S misses the order-0 folios in later slots.
move_swap_pmd() can then move the PMD swap entry whole without calling
folio_move_anon_rmap() or updating folio->index for those later folios.
Note that move_swap_pte() already does this for PTE-mapped swap entries,
because a folio in the swap cache needs its index and mapping updated to
align with dst_vma.
If those folios are later faulted in at dst, their rmap metadata still
points at the old anon_vma/index. Later rmap users derive the virtual
address from folio->mapping and folio->index, so they can look at the
wrong VMA/address ...
Should check the whole PMD swap range before deciding there is no folio
in the swap cache to update?
Am I reading that code right?
>+ mmu_notifier_invalidate_range_end(&range);
>+ if (src_folio) {
>+ folio_unlock(src_folio);
>+ folio_put(src_folio);
>+ }
>+ put_swap_device(si);
>+ return err;
>+ }
>+ spin_unlock(src_ptl);
> return -ENOENT;
> }
>
>--
>2.52.0
>
Cheers, Lance
parent reply other threads:[~2026-06-12 8:51 UTC|newest]
Thread overview: expand[flat|nested] mbox.gz Atom feed
[parent not found: <20260602142537.198755-14-usama.arif@linux.dev>]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612085027.5401-1-lance.yang@linux.dev \
--to=lance.yang@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=baoquan.he@linux.dev \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=usama.arif@linux.dev \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=youngjun.park@lge.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox