Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: dev.jain@arm.com, linmiaohe@huawei.com
Cc: lance.yang@linux.dev, muchun.song@linux.dev, osalvador@suse.de,
	akpm@linux-foundation.org, ljs@kernel.org, david@kernel.org,
	liam@infradead.org, riel@surriel.com, vbabka@kernel.org,
	harry@kernel.org, jannh@google.com, kas@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	rcampbell@nvidia.com, apopple@nvidia.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, mel@csn.ul.ie,
	nao.horiguchi@gmail.com, ak@linux.intel.com,
	j-nomura@ce.jp.nec.com, pfalcato@suse.de, dave.hansen@intel.com,
	tglx@kernel.org, jpoimboe@kernel.org, ryan.roberts@arm.com,
	anshuman.khandual@arm.com, stable@vger.kernel.org
Subject: Re: [PATCH 4/5] mm/page_vma_mapped: use huge_ptep_get() for hugetlb
Date: Fri, 26 Jun 2026 22:10:31 +0800	[thread overview]
Message-ID: <20260626141031.14309-1-lance.yang@linux.dev> (raw)
In-Reply-To: <e6d1b813-0893-458e-9d58-8d3a9bd979c4@arm.com>


On Fri, Jun 26, 2026 at 06:53:10PM +0530, Dev Jain wrote:
>
>
>On 26/06/26 1:18 pm, Lance Yang wrote:
>> 
>> On Thu, Jun 25, 2026 at 11:29:53AM +0000, Dev Jain wrote:
>>> check_pte() is the final validation step in page_vma_mapped_walk().
>>> It reads pvmw->pte with ptep_get() to decide whether the entry maps
>>> the PFN range being walked. For hugetlb VMAs, that pointer refers
>>> to a hugetlb entry.
>>>
>>> On arches which provide their own huge_ptep_get() to dereference a huge
>>> pte pointer, accessing via ptep_get() would cause pte_pfn(),
>>> pte_present() etc to misbehave.
>>>
>>> It is not clear whether this has a trivially visible effect to userspace.
>>>
>>> Use huge_ptep_get() to dereference a huge pte pointer.
>>>
>>> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>> mm/page_vma_mapped.c | 8 +++++++-
>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index 2ccbabfb2cc17..18e1d341f463c 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -107,7 +107,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
>>> static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>> {
>>> 	unsigned long pfn;
>>> -	pte_t ptent = ptep_get(pvmw->pte);
>>> +	pte_t ptent;
>>> +
>>> +	if (is_vm_hugetlb_page(pvmw->vma))
>>> +		ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>>> +				      pvmw->pte);
>> 
>> I think check_pte() can pass a wrong address to huge_ptep_get() ...
>
>Won't this be handled by rmap_walk_anon/rmap_walk_file - they are the ones
>performing the rmap traversal and passing address to try_to_unmap_one/folio_referenced_one
>etc ...

Right, that should cover the rmap callbacks. The bit I was worried about
is page_mapped_in_vma() though.

>> 
>> Not sure that is wrong in the first place. For memory failure,
>> page_mapped_in_vma() can be called with a poisoned tail page of a hugetlb
>> folio. In that case, pvmw->address need not be hugepage-aligned.
>> 
>> @Miaohe

For hugetlb memory failure we start with the poisoned PFN:

static int try_memory_failure_hugetlb(unsigned long pfn, int flags)
{
	...
	struct page *p = pfn_to_page(pfn);
	struct folio *folio;
	...

	folio = page_folio(p);

	...

	if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
		...
	}

	...
}

and pass the same p down:

static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
		unsigned long pfn, int flags)
{
	...

	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);

	...
}

static void collect_procs(const struct folio *folio, const struct page *page,
		struct list_head *tokill, int force_early)
{
	...

	if (unlikely(folio_test_ksm(folio)))
		...
	else if (folio_test_anon(folio))
		collect_procs_anon(folio, page, tokill, force_early);
	else
		...
}

So collect_procs_anon() still gets the poisoned page, not &folio->page:

static void collect_procs_anon(const struct folio *folio,
		const struct page *page, struct list_head *to_kill,
		int force_early)
{
	...

	pgoff = page_pgoff(folio, page);
	rcu_read_lock();
	for_each_process(tsk) {
		...
		
		anon_vma_interval_tree_foreach(vmac, &av->rb_root,
					       pgoff, pgoff) {
			...
			addr = page_mapped_in_vma(page, vma);
			...
		}
	}
	rcu_read_unlock();
	anon_vma_unlock_read(av);
}

page_mapped_in_vma() then builds pvmw for that page:

unsigned long page_mapped_in_vma(const struct page *page,
		struct vm_area_struct *vma)
{
	const struct folio *folio = page_folio(page);
	struct page_vma_mapped_walk pvmw = {
		.pfn = page_to_pfn(page),
		.nr_pages = 1,
		.vma = vma,
		.flags = PVMW_SYNC,
	};

	pvmw.address = vma_address(vma, page_pgoff(folio, page), 1);
	...
}

and page_pgoff() includes the subpage index:

static inline pgoff_t page_pgoff(const struct folio *folio,
		const struct page *page)
{
	return folio->index + folio_page_idx(folio, page);
}

So if the poisoned PFN points to a tail page, pvmw->address can be offset
from the start of the hugetlb mapping by

folio_page_idx(folio, page) << PAGE_SHIFT

Should check_pte() pass the hugepage-aligned address to huge_ptep_get()
for that case?

Cheers, Lance

>> 
>> For arm64, CONT_PMD_SIZE is one supported HugeTLB size. With such a VMA,
>> page_vma_mapped_walk() passes that size to hugetlb_walk():
>> 
>> bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>> {
>> 	...
>> 	if (unlikely(is_vm_hugetlb_page(vma))) {
>> 		...
>> 		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>> 		...
>> 	}
>> 	...
>> }
>> 
>> hugetlb_walk() then calls arm64 huge_pte_offset(mm, addr, sz). For
>> sz == CONT_PMD_SIZE, huge_pte_offset() aligns its local addr before
>> calculating pmdp:
>> 
>> pte_t *huge_pte_offset(struct mm_struct *mm,
>> 		       unsigned long addr, unsigned long sz)
>> {
>> 	...
>> 	if (sz == CONT_PMD_SIZE)
>> 		addr &= CONT_PMD_MASK;
>> 
>> 	pmdp = pmd_offset(pudp, addr);
>> 	pmd = READ_ONCE(*pmdp);
>> 	...
>> }
>> 
>> So for that case, pvmw->pte is calculated from the aligned addr, not
>> necessarily from the original pvmw->address. But check_pte() passes the
>> original address together with pvmw->pte:
>> 
>> +		ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>> +				      pvmw->pte);
>> 
>> arm64 then uses that addr again to choose ncontig:
>> 
>> pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>> {
>> 	...
>> 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>> 	for (i = 0; i < ncontig; i++, ptep++) {
>> 		...
>> 	}
>> 	return orig_pte;
>> }
>> 
>> static int find_num_contig(struct mm_struct *mm, unsigned long addr,
>> 			   pte_t *ptep, size_t *pgsize)
>> {
>> 	pgd_t *pgdp = pgd_offset(mm, addr);
>> 	p4d_t *p4dp;
>> 	pud_t *pudp;
>> 	pmd_t *pmdp;
>> 
>> 	*pgsize = PAGE_SIZE;
>> 	p4dp = p4d_offset(pgdp, addr);
>> 	pudp = pud_offset(p4dp, addr);
>> 	pmdp = pmd_offset(pudp, addr);
>> 	if ((pte_t *)pmdp == ptep) {
>> 		*pgsize = PMD_SIZE;
>> 		return CONT_PMDS;
>> 	}
>> 	return CONT_PTES;
>> }
>> 
>> With a tail address, pmdp may no longer point at pvmw->pte, so
>> find_num_contig() can return CONT_PTES for a CONT_PMD HugeTLB mapping.
>> 
>> On 16K arm64, that changes ncontig from 32 to 128. So huge_ptep_get()
>> can walk past the CONT_PMD entries, and possibly past the PMD table.
>> 
>> Should check_pte() pass the address matching pvmw->pte, sth like:
>> 
>> ---8<---
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index 406fd50bbd8f..58463493bd3d 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -109,11 +109,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>  	unsigned long pfn;
>>  	pte_t ptent;
>> 
>> -	if (is_vm_hugetlb_page(pvmw->vma))
>> -		ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>> -				      pvmw->pte);
>> -	else
>> +	if (is_vm_hugetlb_page(pvmw->vma)) {
>> +		struct hstate *hstate = hstate_vma(pvmw->vma);
>> +		unsigned long haddr = pvmw->address & huge_page_mask(hstate);
>> +
>> +		ptent = huge_ptep_get(pvmw->vma->vm_mm, haddr, pvmw->pte);
>> +	} else {
>>  		ptent = ptep_get(pvmw->pte);
>> +	}
>> 
>>  	if (pvmw->flags & PVMW_MIGRATION) {
>>  		const softleaf_t entry = softleaf_from_pte(ptent);
>> --
>> 
>> while leaving pvmw->address unchanged for page_mapped_in_vma()?
>> 
>> Cheers, Lance
>> 
>>> +	else
>>> +		ptent = ptep_get(pvmw->pte);
>>>
>>> 	if (pvmw->flags & PVMW_MIGRATION) {
>>> 		const softleaf_t entry = softleaf_from_pte(ptent);
>>> -- 
>>> 2.43.0
>>>
>>>
>
>


  reply	other threads:[~2026-06-26 14:10 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 11:29 [PATCH 0/5] Fix incorrect access of hugetlb pte entries Dev Jain
2026-06-25 11:29 ` [PATCH 1/5] mm/rmap: use huge_ptep_get() in try_to_unmap_one() Dev Jain
2026-06-26  3:17   ` Muchun Song
2026-06-26  4:03     ` Dev Jain
2026-06-26  4:16       ` Muchun Song
2026-06-25 11:29 ` [PATCH 2/5] mm/rmap: use huge_ptep_get() in try_to_migrate_one() Dev Jain
2026-06-26  3:24   ` Muchun Song
2026-06-25 11:29 ` [PATCH 3/5] mm/migrate: use huge_ptep_get() in remove_migration_pte() Dev Jain
2026-06-26  3:32   ` Muchun Song
2026-06-25 11:29 ` [PATCH 4/5] mm/page_vma_mapped: use huge_ptep_get() for hugetlb Dev Jain
2026-06-26  2:31   ` Lance Yang
2026-06-26  4:06     ` Dev Jain
2026-06-26  7:48   ` Lance Yang
2026-06-26  9:14     ` Lance Yang
2026-06-26 13:23     ` Dev Jain
2026-06-26 14:10       ` Lance Yang [this message]
2026-06-26 15:26         ` Dev Jain
2026-06-26 16:46           ` Lance Yang
2026-06-25 11:29 ` [PATCH 5/5] mm/mprotect: " Dev Jain
2026-06-26  3:40   ` Muchun Song
2026-06-26  4:08     ` Dev Jain
2026-06-26  4:21       ` Muchun Song
2026-06-26  4:42         ` Dev Jain
2026-06-25 13:59 ` [PATCH 0/5] Fix incorrect access of hugetlb pte entries Zi Yan
2026-06-26  4:09   ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260626141031.14309-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=harry@kernel.org \
    --cc=j-nomura@ce.jp.nec.com \
    --cc=jannh@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=jpoimboe@kernel.org \
    --cc=kas@kernel.org \
    --cc=liam@infradead.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=mel@csn.ul.ie \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=rcampbell@nvidia.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox