From: Lance Yang <lance.yang@linux.dev>
To: richard.weiyang@gmail.com
Cc: david@kernel.org, akpm@linux-foundation.org, riel@surriel.com,
liam@infradead.org, vbabka@kernel.org, harry@kernel.org,
jannh@google.com, balbirs@nvidia.com, ziy@nvidia.com,
sj@kernel.org, linux-mm@kvack.org, stable@vger.kernel.org,
ljs@kernel.org, Lance Yang <lance.yang@linux.dev>
Subject: Re: [Patch v2] mm/page_vma_mapped: revalidate and do proper check before return device-private pmd
Date: Mon, 22 Jun 2026 19:21:47 +0800 [thread overview]
Message-ID: <20260622112147.66777-1-lance.yang@linux.dev> (raw)
In-Reply-To: <20260620021353.nn7xp2ldqachq7gp@master>
On Sat, Jun 20, 2026 at 02:13:53AM +0000, Wei Yang wrote:
>On Fri, Jun 19, 2026 at 12:48:26PM +0200, David Hildenbrand (Arm) wrote:
>>On 6/19/26 12:44, Lorenzo Stoakes wrote:
>>> -cc wrong email
>>>
>>> On Tue, Jun 16, 2026 at 06:34:36AM +0000, Wei Yang wrote:
>>>> For pmd_trans_huge() and pmd_is_migration_entry(), we does following
>>>> before return the pmd entry:
>>>>
>>>> * re-validate pmd entry after PTL
>>>> * check PVMW_MIGRATION
>>>> * check_pmd()
>>>> * handle on pte level if split under us
>>>>
>>>> But for device-private pmd, we just return after pmd_lock().
>>>>
>>>> This may return improper entry, e.g. if we are looking for a migration
>>>> entry, device-private entry could still be returned, which leads to data
>>>> corruption.
>>>
>>> I don't thik this is quite clear?
>>>
>>> How about:
>>>
>>> If a softleaf entry is present, the existing code simply acquires the
>>> PMD lock and returns success even if PVMW_MIGRATION is set (indicating a
>>> migration entry is sought), meaning that the caller can incorrectly
>>> interpret the entry as something it is not, causing data corruption.
>>>
>>>>
>>>> This patch fixes commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration
>>>> support device-private entries") by following the same pattern as
>>>> pmd_trans_huge() and pmd_is_migration_entry() for device private entry.
>>>>
>>>> While at it, it cleanups the pmd entry handling in page_vma_mapped_walk().
>>>>
>>>> * Instead of handling trans huge/migration entry/device private entry
>>>> in a mixed manner, we put each case into its own if condition and
>>>> handle with the same pattern.
>>>> * Also we grab PTL and make sure pmd is not changed under us after
>>>> above check instead of do the check with PTL hold.
>>>> * restart the process if pmd is changed under us
>>>
>>> You're doing quite a bit for a fix and you're putting it all in one place.
>>>
>>> How about do the fix as 1 patch, and then cleanups as other ones? It helps with
>>> review too :)
>>>
>>> It's a general rule of thumb that if you do more than one of moving, refactoring
>>> or changing code, to do them as separate patches so a reviewer/somebody
>>> bisecting can clearly separate each.
>>>
>>> Also PLEASE do not add new functionality (this lock recheck) in a fixes
>>> patch. We'll end up backporting new logic that way.
>>>
>>> Make the fixes bit _minimal_.
>>
>>To be fair, I asked for this
>>
>>https://lore.kernel.org/all/2d48ef0d-1110-4a9d-adcb-f701a1ce2cfa@kernel.org/
>>
>>But given that Wei mostly used my quick draft without properly checking the
>>implications, yeah, let's fix it first separately.
>
>Sorry, if I misunderstand your point.
>
>>
>>I can then follow up with a proper cleanup.
FYI, spent a few days chasing the history here. Dropping my notes in case
they save someone else some time reading or refactoring this code :P
TL;DR below.
For THP PMDs, the lockless pmd_trans_huge() check was only a candidate
filter from day one.
ace71a19cec5 [1] ("mm: introduce page_vma_mapped_walk()") already had the
lock-and-recheck rule: the lockless check only got us into the branch,
then the code took pmd_lock() before making any PMD-level decision:
if (pmd_trans_huge(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else {
...
}
616b8371539a [2] ("mm: thp: enable thp migration in generic path") later
added PMD migration entries to that same locked branch:
if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else {
...
}
Current code spells that helper as pmd_is_migration_entry(); that came
from the later 0ac881efe164 softleaf cleanup, with no functional change
intended.
3306d3119cea [3] ("mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd")
later made the post-lock PMD value explicit as pmde. Same rule, just less
repeated *pvmw->pmd dereferencing under PTL:
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
pmde = *pvmw->pmd;
...
}
One extra detail matters here. In ace71a19cec5 [1], the locked THP PMD
branch rejected a non-present PMD before the present-THP check:
if (pmd_trans_huge(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (!pmd_present(*pvmw->pmd))
return not_found(pvmw);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
...
} else {
...
}
}
616b8371539a [2] could not keep that order after adding PMD migration
entries, because a PMD migration entry is non-present. So the locked
branch was reshaped to check the still-present THP PMD case first, and
only then handle the non-present PMD as the PMD migration-entry case:
if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
...
} else if (!pmd_present(*pvmw->pmd)) {
...
} else {
...
}
}
Inside that non-present branch, 616b8371539a [2] made it a
PMD-migration-only case: first require THP migration support, then
require PVMW_MIGRATION. Otherwise it is not a match:
if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
...
} else if (!pmd_present(*pvmw->pmd)) {
if (thp_migration_supported()) {
if (!(pvmw->flags & PVMW_MIGRATION))
return not_found(pvmw);
...
} else
WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
return not_found(pvmw);
} else {
...
}
}
Note that PMD-level swap entries for anonymous THPs were not supported
then, and still aren't supported today.
After those gates, the entry still had to be a migration entry for the
target THP. Functionally, that check goes back to 616b8371539a [2]:
if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
...
} else if (!pmd_present(*pvmw->pmd)) {
if (thp_migration_supported()) {
if (!(pvmw->flags & PVMW_MIGRATION))
return not_found(pvmw);
if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
if (migration_entry_to_page(entry) != page)
return not_found(pvmw);
return true;
}
} else
WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
return not_found(pvmw);
} else {
...
}
}
The later commits mostly changed how the same rule is written.
e2e1d4076c77 rewrote it into the negative-check style; 2aff7a4755be
changed the page comparison to check_pmd() when pvmw moved to
pfn/nr_pages; 0d206b5d2e0d switched the PFN extraction to
swp_offset_pfn(); and 0ac881efe164 moved the PMD swap-entry helpers over
to softleaf helpers.
e2e1d4076c77: nested positive style -> negative-check style
2aff7a4755be: page comparison -> check_pmd(PFN range)
0d206b5d2e0d: swp_offset() -> swp_offset_pfn()
0ac881efe164: pmd_to_swp_entry/is_migration_entry -> softleaf helpers
For the PMD-mapped THP case, the positive decision was also made under
pmd_lock() from day one. In ace71a19cec5 [1], after taking pmd_lock(),
the PMD had to still be a THP PMD, the walk had to be a
non-PVMW_MIGRATION walk, and the PMD had to map the target THP:
if (pmd_trans_huge(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
if (likely(pmd_trans_huge(*pvmw->pmd))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
if (pmd_page(*pvmw->pmd) != page)
return not_found(pvmw);
return true;
} else {
...
}
} else {
...
}
2aff7a4755be later changed the target check from pmd_page(...) == page to
check_pmd(pmd_pfn(pmde), pvmw), when pvmw moved to pfn/nr_pages. Same
rule, but range-based now.
The split fallback was there from ace71a19cec5 [1] as well. After taking
pmd_lock(), if the locked PMD was no longer a THP PMD, the walker did not
make a PMD-level true/not_found decision. It dropped pmd_lock(), cleared
pvmw->ptl, and continued at PTE level:
if (pmd_trans_huge(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
if (likely(pmd_trans_huge(*pvmw->pmd))) {
...
} else {
/* THP pmd was split under us: handle on pte level */
spin_unlock(pvmw->ptl);
pvmw->ptl = NULL;
}
} else {
...
}
if (!map_pte(pvmw))
goto next_pte;
616b8371539a [2] kept that fallback after adding PMD migration entries.
e2e1d4076c77 only rearranged the !pmd_present() block; the split fallback
still drops pmd_lock() and falls through to map_pte().
Now the outer non-present PMD case. In ace71a19cec5 [1], this was not an
explicit else-if branch yet. It was hidden in check_pmd():
static inline bool check_pmd(struct page_vma_mapped_walk *pvmw)
{
pmd_t pmde;
/*
* Make sure we don't re-load pmd between present and !trans_huge check.
* We need a consistent view.
*/
pmde = READ_ONCE(*pvmw->pmd);
return pmd_present(pmde) && !pmd_trans_huge(pmde);
}
and the caller was:
if (pmd_trans_huge(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else {
if (!check_pmd(pvmw))
return false;
}
So outside the THP PMD branch, a non-present PMD failed check_pmd() and
ended the walk.
a7b100953aa3 [4] ("mm: page_vma_mapped: ensure pmd is loaded with
READ_ONCE outside of lock") then made that lockless PMD value explicit in
page_vma_mapped_walk():
So this branch means: the lockless PMD read did not look like a THP PMD,
did not look like a PMD migration entry, and was non-present.
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else if (!pmd_present(pmde)) {
return false;
}
The point of that commit was to use the same READ_ONCE() PMD snapshot for
the initial lockless checks, rather than letting the compiler reload or
reuse a stale PMD value around check_pmd().
732ed55823fc [5] ("mm/thp: try_to_unmap() use TTU_SYNC for safe
splitting") then added the PVMW_SYNC wait to this same outer non-present
PMD branch:
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else if (!pmd_present(pmde)) {
/*
* If PVMW_SYNC, take and drop THP pmd lock so that we
* cannot return prematurely, while zap_huge_pmd() has
* cleared *pmd but not decremented compound_mapcount().
*/
if ((pvmw->flags & PVMW_SYNC) &&
PageTransCompound(pvmw->page)) {
spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
spin_unlock(ptl);
}
return false;
}
A lockless non-present PMD was still not a mapping. PVMW_SYNC only
changed whether the walker could return immediately: under PVMW_SYNC, it
had to take and drop pmd_lock() first, so it would not return while
zap_huge_pmd() had cleared *pmd but not yet decremented
compound_mapcount().
a9a7504d9bea [6] ("mm/thp: fix page_vma_mapped_walk() if THP mapped by
ptes") then changed the end of this branch from return false to
step_forward(PMD_SIZE) + continue:
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else if (!pmd_present(pmde)) {
/*
* If PVMW_SYNC, take and drop THP pmd lock so that we
* cannot return prematurely, while zap_huge_pmd() has
* cleared *pmd but not decremented compound_mapcount().
*/
if ((pvmw->flags & PVMW_SYNC) &&
PageTransCompound(page)) {
spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
spin_unlock(ptl);
}
step_forward(pvmw, PMD_SIZE);
continue;
}
if (!map_pte(pvmw))
goto next_pte;
That was about the walk range, not about making a non-present PMD a match.
step_forward(pvmw, PMD_SIZE) advances pvmw->address to the next PMD
boundary, and continue restarts the walk from there. So this branch only
rules out the current PMD-sized slot.
pmde = pmdp_get_lockless(pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
(pmd_present(pmde) && pmd_devmap(pmde))) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else if (!pmd_present(pmde)) {
/*
* If PVMW_SYNC, take and drop THP pmd lock so that we
* cannot return prematurely, while zap_huge_pmd() has
* cleared *pmd but not decremented compound_mapcount().
*/
if ((pvmw->flags & PVMW_SYNC) &&
thp_vma_suitable_order(vma, pvmw->address,
PMD_ORDER) &&
(pvmw->nr_pages >= HPAGE_PMD_NR)) {
spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
spin_unlock(ptl);
}
step_forward(pvmw, PMD_SIZE);
continue;
}
2aff7a4755be later converted page_vma_mapped_walk() to pfn/nr_pages, so
the PVMW_SYNC condition became VMA/range based.
c453d8c7d138 tightened the VMA side to transhuge_vma_suitable(vma,
pvmw->address), and 3485b88390b0 later made that helper order-aware,
giving the current thp_vma_suitable_order(vma, pvmw->address, PMD_ORDER)
spelling.
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
(pmd_present(pmde) && pmd_devmap(pmde))) {
There was also a temporary devmap wrinkle. 6472f6d2f7d9 added
pmd_devmap() here because huge devmap PMDs were not covered by the old
pmd_trans_huge() test.
8a6a984c2e0e later removed the explicit pmd_devmap() checks because DAX
no longer created pmd_devmap entries, and pXd_trans_huge() covered those
mappings. This did not change the outer non-present PMD rule above.
65edfda6f3f2 [7] ("mm/rmap: extend rmap and migration support
device-private entries") later added the device-private PMD case into
this same outer non-present branch:
pmde = pmdp_get_lockless(pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
pmde = *pvmw->pmd;
...
} else if (!pmd_present(pmde)) {
/*
* If PVMW_SYNC, take and drop THP pmd lock so that we
* cannot return prematurely, while zap_huge_pmd() has
* cleared *pmd but not decremented compound_mapcount().
*/
swp_entry_t entry = pmd_to_swp_entry(pmde);
if (is_device_private_entry(entry)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
return true;
}
...
}
That changed the old rule for this branch: a lockless non-present PMD was
no longer always skipped. If the lockless PMD decoded as device-private,
the walker took pmd_lock() and returned true.
0ac881efe164 later converted this code to softleaf helpers, with no
functional change intended. So the names changed, but the control flow
stayed the same:
pmde = pmdp_get_lockless(pvmw->pmd);
if (pmd_trans_huge(pmde) || pmd_is_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
...
} else if (!pmd_present(pmde)) {
const softleaf_t entry = softleaf_from_pmd(pmde);
if (softleaf_is_device_private(entry)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
return true;
}
if ((pvmw->flags & PVMW_SYNC) &&
thp_vma_suitable_order(vma, pvmw->address,
PMD_ORDER) &&
(pvmw->nr_pages >= HPAGE_PMD_NR))
sync_with_folio_pmd_zap(mm, pvmw->pmd);
step_forward(pvmw, PMD_SIZE);
continue;
}
514c2fe9927e later moved the PVMW_SYNC take/drop pmd_lock() helper into
sync_with_folio_pmd_zap().
Note that PMD device-private handling still did not line up with the PTE
side, as I pointed out earlier[8].
[1] https://lore.kernel.org/all/20170129173858.45174-3-kirill.shutemov@linux.intel.com/T/#u
[2] https://lore.kernel.org/all/20170717193955.20207-6-zi.yan@sent.com/
[3] https://lore.kernel.org/all/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com/
[4] https://lore.kernel.org/all/1507222630-5839-1-git-send-email-will.deacon@arm.com/
[5] https://lore.kernel.org/linux-mm/20210616012353._aWHpXxeZ%25akpm@linux-foundation.org/
[6] https://lore.kernel.org/all/fedb8632-1798-de42-f39e-873551d5bc81@google.com/
[7] https://lore.kernel.org/all/20251001065707.920170-5-balbirs@nvidia.com/
[8] https://lore.kernel.org/all/20260619121909.90510-1-lance.yang@linux.dev/
[...]
Cheers, Lance
next prev parent reply other threads:[~2026-06-22 11:22 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260616063436.20455-1-richard.weiyang@gmail.com>
2026-06-16 12:30 ` [Patch v2] mm/page_vma_mapped: revalidate and do proper check before return device-private pmd Lance Yang
2026-06-16 23:50 ` Wei Yang
2026-06-17 2:32 ` Lance Yang
2026-06-17 8:18 ` Wei Yang
2026-06-19 2:30 ` Wei Yang
2026-06-19 12:19 ` Lance Yang
2026-06-20 2:02 ` Wei Yang
[not found] ` <666dc40b-e37a-46eb-af55-7a81bc1643f1@kernel.org>
2026-06-17 2:11 ` Balbir Singh
2026-06-17 3:14 ` Lance Yang
2026-06-19 10:44 ` Lorenzo Stoakes
2026-06-19 10:48 ` David Hildenbrand (Arm)
2026-06-19 11:04 ` Lorenzo Stoakes
2026-06-20 2:13 ` Wei Yang
2026-06-22 11:21 ` Lance Yang [this message]
2026-06-22 11:33 ` David Hildenbrand (Arm)
2026-06-22 12:52 ` Wei Yang
2026-06-20 2:11 ` Wei Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260622112147.66777-1-lance.yang@linux.dev \
--to=lance.yang@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=balbirs@nvidia.com \
--cc=david@kernel.org \
--cc=harry@kernel.org \
--cc=jannh@google.com \
--cc=liam@infradead.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=richard.weiyang@gmail.com \
--cc=riel@surriel.com \
--cc=sj@kernel.org \
--cc=stable@vger.kernel.org \
--cc=vbabka@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox