All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: richard.weiyang@gmail.com
Cc: david@kernel.org, akpm@linux-foundation.org, riel@surriel.com,
	liam@infradead.org, vbabka@kernel.org, harry@kernel.org,
	jannh@google.com, balbirs@nvidia.com, ziy@nvidia.com,
	sj@kernel.org, linux-mm@kvack.org, stable@vger.kernel.org,
	ljs@kernel.org, Lance Yang <lance.yang@linux.dev>
Subject: Re: [Patch v2] mm/page_vma_mapped: revalidate and do proper check before return device-private pmd
Date: Mon, 22 Jun 2026 19:21:47 +0800	[thread overview]
Message-ID: <20260622112147.66777-1-lance.yang@linux.dev> (raw)
In-Reply-To: <20260620021353.nn7xp2ldqachq7gp@master>


On Sat, Jun 20, 2026 at 02:13:53AM +0000, Wei Yang wrote:
>On Fri, Jun 19, 2026 at 12:48:26PM +0200, David Hildenbrand (Arm) wrote:
>>On 6/19/26 12:44, Lorenzo Stoakes wrote:
>>> -cc wrong email
>>> 
>>> On Tue, Jun 16, 2026 at 06:34:36AM +0000, Wei Yang wrote:
>>>> For pmd_trans_huge() and pmd_is_migration_entry(), we does following
>>>> before return the pmd entry:
>>>>
>>>>   * re-validate pmd entry after PTL
>>>>   * check PVMW_MIGRATION
>>>>   * check_pmd()
>>>>   * handle on pte level if split under us
>>>>
>>>> But for device-private pmd, we just return after pmd_lock().
>>>>
>>>> This may return improper entry, e.g. if we are looking for a migration
>>>> entry, device-private entry could still be returned, which leads to data
>>>> corruption.
>>> 
>>> I don't thik this is quite clear?
>>> 
>>> How about:
>>> 
>>> 	If a softleaf entry is present, the existing code simply acquires the
>>> 	PMD lock and returns success even if PVMW_MIGRATION is set (indicating a
>>> 	migration entry is sought), meaning that the caller can incorrectly
>>> 	interpret the entry as something it is not, causing data corruption.
>>> 
>>>>
>>>> This patch fixes commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration
>>>> support device-private entries") by following the same pattern as
>>>> pmd_trans_huge() and pmd_is_migration_entry() for device private entry.
>>>>
>>>> While at it, it cleanups the pmd entry handling in page_vma_mapped_walk().
>>>>
>>>>   * Instead of handling trans huge/migration entry/device private entry
>>>>     in a mixed manner, we put each case into its own if condition and
>>>>     handle with the same pattern.
>>>>   * Also we grab PTL and make sure pmd is not changed under us after
>>>>     above check instead of do the check with PTL hold.
>>>>   * restart the process if pmd is changed under us
>>> 
>>> You're doing quite a bit for a fix and you're putting it all in one place.
>>> 
>>> How about do the fix as 1 patch, and then cleanups as other ones? It helps with
>>> review too :)
>>> 
>>> It's a general rule of thumb that if you do more than one of moving, refactoring
>>> or changing code, to do them as separate patches so a reviewer/somebody
>>> bisecting can clearly separate each.
>>> 
>>> Also PLEASE do not add new functionality (this lock recheck) in a fixes
>>> patch. We'll end up backporting new logic that way.
>>> 
>>> Make the fixes bit _minimal_.
>>
>>To be fair, I asked for this
>>
>>https://lore.kernel.org/all/2d48ef0d-1110-4a9d-adcb-f701a1ce2cfa@kernel.org/
>>
>>But given that Wei mostly used my quick draft without properly checking the
>>implications, yeah, let's fix it first separately.
>
>Sorry, if I misunderstand your point.
>
>>
>>I can then follow up with a proper cleanup.

FYI, spent a few days chasing the history here. Dropping my notes in case
they save someone else some time reading or refactoring this code :P

TL;DR below.

For THP PMDs, the lockless pmd_trans_huge() check was only a candidate
filter from day one.

ace71a19cec5 [1] ("mm: introduce page_vma_mapped_walk()") already had the
lock-and-recheck rule: the lockless check only got us into the branch,
then the code took pmd_lock() before making any PMD-level decision:

	if (pmd_trans_huge(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		...
	} else {
		...
	}

616b8371539a [2] ("mm: thp: enable thp migration in generic path") later
added PMD migration entries to that same locked branch:

	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		...
	} else {
		...
	}

Current code spells that helper as pmd_is_migration_entry(); that came
from the later 0ac881efe164 softleaf cleanup, with no functional change
intended.

3306d3119cea [3] ("mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd")
later made the post-lock PMD value explicit as pmde. Same rule, just less
repeated *pvmw->pmd dereferencing under PTL:

	pmde = READ_ONCE(*pvmw->pmd);
	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		pmde = *pvmw->pmd;
		...
	}

One extra detail matters here. In ace71a19cec5 [1], the locked THP PMD
branch rejected a non-present PMD before the present-THP check:

	if (pmd_trans_huge(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		if (!pmd_present(*pvmw->pmd))
			return not_found(pvmw);
		if (likely(pmd_trans_huge(*pvmw->pmd))) {
			...
		} else {
			...
		}
	}

616b8371539a [2] could not keep that order after adding PMD migration
entries, because a PMD migration entry is non-present. So the locked
branch was reshaped to check the still-present THP PMD case first, and
only then handle the non-present PMD as the PMD migration-entry case:

	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		if (likely(pmd_trans_huge(*pvmw->pmd))) {
			...
		} else if (!pmd_present(*pvmw->pmd)) {
			...
		} else {
			...
		}
	}

Inside that non-present branch, 616b8371539a [2] made it a
PMD-migration-only case: first require THP migration support, then
require PVMW_MIGRATION. Otherwise it is not a match:

	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		if (likely(pmd_trans_huge(*pvmw->pmd))) {
			...
		} else if (!pmd_present(*pvmw->pmd)) {
			if (thp_migration_supported()) {
				if (!(pvmw->flags & PVMW_MIGRATION))
					return not_found(pvmw);
				...
			} else
				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
			return not_found(pvmw);
		} else {
			...
		}
	}

Note that PMD-level swap entries for anonymous THPs were not supported
then, and still aren't supported today.

After those gates, the entry still had to be a migration entry for the
target THP. Functionally, that check goes back to 616b8371539a [2]:

	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		if (likely(pmd_trans_huge(*pvmw->pmd))) {
			...
		} else if (!pmd_present(*pvmw->pmd)) {
			if (thp_migration_supported()) {
				if (!(pvmw->flags & PVMW_MIGRATION))
					return not_found(pvmw);
				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);

					if (migration_entry_to_page(entry) != page)
						return not_found(pvmw);
					return true;
				}
			} else
				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
			return not_found(pvmw);
		} else {
			...
		}
	}

The later commits mostly changed how the same rule is written.
e2e1d4076c77 rewrote it into the negative-check style; 2aff7a4755be
changed the page comparison to check_pmd() when pvmw moved to
pfn/nr_pages; 0d206b5d2e0d switched the PFN extraction to
swp_offset_pfn(); and 0ac881efe164 moved the PMD swap-entry helpers over
to softleaf helpers.

e2e1d4076c77: nested positive style -> negative-check style
2aff7a4755be: page comparison -> check_pmd(PFN range)
0d206b5d2e0d: swp_offset() -> swp_offset_pfn()
0ac881efe164: pmd_to_swp_entry/is_migration_entry -> softleaf helpers

For the PMD-mapped THP case, the positive decision was also made under
pmd_lock() from day one. In ace71a19cec5 [1], after taking pmd_lock(),
the PMD had to still be a THP PMD, the walk had to be a
non-PVMW_MIGRATION walk, and the PMD had to map the target THP:

	if (pmd_trans_huge(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		...
		if (likely(pmd_trans_huge(*pvmw->pmd))) {
			if (pvmw->flags & PVMW_MIGRATION)
				return not_found(pvmw);
			if (pmd_page(*pvmw->pmd) != page)
				return not_found(pvmw);
			return true;
		} else {
			...
		}
	} else {
		...
	}

2aff7a4755be later changed the target check from pmd_page(...) == page to
check_pmd(pmd_pfn(pmde), pvmw), when pvmw moved to pfn/nr_pages. Same
rule, but range-based now.

The split fallback was there from ace71a19cec5 [1] as well. After taking
pmd_lock(), if the locked PMD was no longer a THP PMD, the walker did not
make a PMD-level true/not_found decision. It dropped pmd_lock(), cleared
pvmw->ptl, and continued at PTE level:

	if (pmd_trans_huge(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		...
		if (likely(pmd_trans_huge(*pvmw->pmd))) {
			...
		} else {
			/* THP pmd was split under us: handle on pte level */
			spin_unlock(pvmw->ptl);
			pvmw->ptl = NULL;
		}
	} else {
		...
	}
	if (!map_pte(pvmw))
		goto next_pte;

616b8371539a [2] kept that fallback after adding PMD migration entries.
e2e1d4076c77 only rearranged the !pmd_present() block; the split fallback
still drops pmd_lock() and falls through to map_pte().

Now the outer non-present PMD case. In ace71a19cec5 [1], this was not an
explicit else-if branch yet. It was hidden in check_pmd():

static inline bool check_pmd(struct page_vma_mapped_walk *pvmw)
{
	pmd_t pmde;
	/*
	 * Make sure we don't re-load pmd between present and !trans_huge check.
	 * We need a consistent view.
	 */
	pmde = READ_ONCE(*pvmw->pmd);
	return pmd_present(pmde) && !pmd_trans_huge(pmde);
}

and the caller was:

	if (pmd_trans_huge(*pvmw->pmd)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		...
	} else {
		if (!check_pmd(pvmw))
			return false;
	}

So outside the THP PMD branch, a non-present PMD failed check_pmd() and
ended the walk.

a7b100953aa3 [4] ("mm: page_vma_mapped: ensure pmd is loaded with
READ_ONCE outside of lock") then made that lockless PMD value explicit in
page_vma_mapped_walk():

So this branch means: the lockless PMD read did not look like a THP PMD,
did not look like a PMD migration entry, and was non-present.

	pmde = READ_ONCE(*pvmw->pmd);
	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		...
	} else if (!pmd_present(pmde)) {
		return false;
	}

The point of that commit was to use the same READ_ONCE() PMD snapshot for
the initial lockless checks, rather than letting the compiler reload or
reuse a stale PMD value around check_pmd().

732ed55823fc [5] ("mm/thp: try_to_unmap() use TTU_SYNC for safe
splitting") then added the PVMW_SYNC wait to this same outer non-present
PMD branch:

	pmde = READ_ONCE(*pvmw->pmd);
	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		...
	} else if (!pmd_present(pmde)) {
		/*
		 * If PVMW_SYNC, take and drop THP pmd lock so that we
		 * cannot return prematurely, while zap_huge_pmd() has
		 * cleared *pmd but not decremented compound_mapcount().
		 */
		if ((pvmw->flags & PVMW_SYNC) &&
		    PageTransCompound(pvmw->page)) {
			spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);

			spin_unlock(ptl);
		}
		return false;
	}

A lockless non-present PMD was still not a mapping. PVMW_SYNC only
changed whether the walker could return immediately: under PVMW_SYNC, it
had to take and drop pmd_lock() first, so it would not return while
zap_huge_pmd() had cleared *pmd but not yet decremented
compound_mapcount().

a9a7504d9bea [6] ("mm/thp: fix page_vma_mapped_walk() if THP mapped by
ptes") then changed the end of this branch from return false to
step_forward(PMD_SIZE) + continue:

		pmde = READ_ONCE(*pvmw->pmd);

		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
			...
		} else if (!pmd_present(pmde)) {
			/*
			 * If PVMW_SYNC, take and drop THP pmd lock so that we
			 * cannot return prematurely, while zap_huge_pmd() has
			 * cleared *pmd but not decremented compound_mapcount().
			 */
			if ((pvmw->flags & PVMW_SYNC) &&
			    PageTransCompound(page)) {
				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);

				spin_unlock(ptl);
			}
			step_forward(pvmw, PMD_SIZE);
			continue;
		}
		if (!map_pte(pvmw))
			goto next_pte;

That was about the walk range, not about making a non-present PMD a match.
step_forward(pvmw, PMD_SIZE) advances pvmw->address to the next PMD
boundary, and continue restarts the walk from there. So this branch only
rules out the current PMD-sized slot.

		pmde = pmdp_get_lockless(pvmw->pmd);

		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
		    (pmd_present(pmde) && pmd_devmap(pmde))) {
			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
			...
		} else if (!pmd_present(pmde)) {
			/*
			 * If PVMW_SYNC, take and drop THP pmd lock so that we
			 * cannot return prematurely, while zap_huge_pmd() has
			 * cleared *pmd but not decremented compound_mapcount().
			 */
			if ((pvmw->flags & PVMW_SYNC) &&
			    thp_vma_suitable_order(vma, pvmw->address,
						   PMD_ORDER) &&
			    (pvmw->nr_pages >= HPAGE_PMD_NR)) {
				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);

				spin_unlock(ptl);
			}
			step_forward(pvmw, PMD_SIZE);
			continue;
		}

2aff7a4755be later converted page_vma_mapped_walk() to pfn/nr_pages, so
the PVMW_SYNC condition became VMA/range based.

c453d8c7d138 tightened the VMA side to transhuge_vma_suitable(vma,
pvmw->address), and 3485b88390b0 later made that helper order-aware,
giving the current thp_vma_suitable_order(vma, pvmw->address, PMD_ORDER)
spelling.

		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
		    (pmd_present(pmde) && pmd_devmap(pmde))) {

There was also a temporary devmap wrinkle. 6472f6d2f7d9 added
pmd_devmap() here because huge devmap PMDs were not covered by the old
pmd_trans_huge() test.

8a6a984c2e0e later removed the explicit pmd_devmap() checks because DAX
no longer created pmd_devmap entries, and pXd_trans_huge() covered those
mappings. This did not change the outer non-present PMD rule above.

65edfda6f3f2 [7] ("mm/rmap: extend rmap and migration support
device-private entries") later added the device-private PMD case into
this same outer non-present branch:

		pmde = pmdp_get_lockless(pvmw->pmd);

		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
			pmde = *pvmw->pmd;
			...
		} else if (!pmd_present(pmde)) {
			/*
			 * If PVMW_SYNC, take and drop THP pmd lock so that we
			 * cannot return prematurely, while zap_huge_pmd() has
			 * cleared *pmd but not decremented compound_mapcount().
			 */
			swp_entry_t entry = pmd_to_swp_entry(pmde);

			if (is_device_private_entry(entry)) {
				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
				return true;
			}

			...
		}

That changed the old rule for this branch: a lockless non-present PMD was
no longer always skipped. If the lockless PMD decoded as device-private,
the walker took pmd_lock() and returned true.

0ac881efe164 later converted this code to softleaf helpers, with no
functional change intended. So the names changed, but the control flow
stayed the same:

		pmde = pmdp_get_lockless(pvmw->pmd);

		if (pmd_trans_huge(pmde) || pmd_is_migration_entry(pmde)) {
			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
			...
		} else if (!pmd_present(pmde)) {
			const softleaf_t entry = softleaf_from_pmd(pmde);

			if (softleaf_is_device_private(entry)) {
				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
				return true;
			}

			if ((pvmw->flags & PVMW_SYNC) &&
			    thp_vma_suitable_order(vma, pvmw->address,
						   PMD_ORDER) &&
			    (pvmw->nr_pages >= HPAGE_PMD_NR))
				sync_with_folio_pmd_zap(mm, pvmw->pmd);

			step_forward(pvmw, PMD_SIZE);
			continue;
		}

514c2fe9927e later moved the PVMW_SYNC take/drop pmd_lock() helper into
sync_with_folio_pmd_zap().

Note that PMD device-private handling still did not line up with the PTE
side, as I pointed out earlier[8].

[1] https://lore.kernel.org/all/20170129173858.45174-3-kirill.shutemov@linux.intel.com/T/#u
[2] https://lore.kernel.org/all/20170717193955.20207-6-zi.yan@sent.com/
[3] https://lore.kernel.org/all/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com/
[4] https://lore.kernel.org/all/1507222630-5839-1-git-send-email-will.deacon@arm.com/
[5] https://lore.kernel.org/linux-mm/20210616012353._aWHpXxeZ%25akpm@linux-foundation.org/
[6] https://lore.kernel.org/all/fedb8632-1798-de42-f39e-873551d5bc81@google.com/
[7] https://lore.kernel.org/all/20251001065707.920170-5-balbirs@nvidia.com/
[8] https://lore.kernel.org/all/20260619121909.90510-1-lance.yang@linux.dev/

[...]

Cheers, Lance


  reply	other threads:[~2026-06-22 11:22 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260616063436.20455-1-richard.weiyang@gmail.com>
2026-06-16 12:30 ` [Patch v2] mm/page_vma_mapped: revalidate and do proper check before return device-private pmd Lance Yang
2026-06-16 23:50   ` Wei Yang
2026-06-17  2:32     ` Lance Yang
2026-06-17  8:18       ` Wei Yang
2026-06-19  2:30         ` Wei Yang
2026-06-19 12:19           ` Lance Yang
2026-06-20  2:02             ` Wei Yang
     [not found]   ` <666dc40b-e37a-46eb-af55-7a81bc1643f1@kernel.org>
2026-06-17  2:11     ` Balbir Singh
2026-06-17  3:14       ` Lance Yang
2026-06-19 10:44 ` Lorenzo Stoakes
2026-06-19 10:48   ` David Hildenbrand (Arm)
2026-06-19 11:04     ` Lorenzo Stoakes
2026-06-20  2:13     ` Wei Yang
2026-06-22 11:21       ` Lance Yang [this message]
2026-06-22 11:33       ` David Hildenbrand (Arm)
2026-06-22 12:52         ` Wei Yang
2026-06-20  2:11   ` Wei Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260622112147.66777-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=balbirs@nvidia.com \
    --cc=david@kernel.org \
    --cc=harry@kernel.org \
    --cc=jannh@google.com \
    --cc=liam@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=richard.weiyang@gmail.com \
    --cc=riel@surriel.com \
    --cc=sj@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.