Re: [PATCH 4/5] mm/page_vma_mapped: use huge_ptep_get() for hugetlb

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lance Yang <lance.yang@linux.dev>
To: Dev Jain <dev.jain@arm.com>, linmiaohe@huawei.com
Cc: muchun.song@linux.dev, osalvador@suse.de,
	akpm@linux-foundation.org, ljs@kernel.org, david@kernel.org,
	liam@infradead.org, riel@surriel.com, vbabka@kernel.org,
	harry@kernel.org, jannh@google.com, kas@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	rcampbell@nvidia.com, apopple@nvidia.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, mel@csn.ul.ie,
	nao.horiguchi@gmail.com, ak@linux.intel.com,
	j-nomura@ce.jp.nec.com, pfalcato@suse.de, dave.hansen@intel.com,
	tglx@kernel.org, jpoimboe@kernel.org, ryan.roberts@arm.com,
	anshuman.khandual@arm.com, stable@vger.kernel.org
Subject: Re: [PATCH 4/5] mm/page_vma_mapped: use huge_ptep_get() for hugetlb
Date: Sat, 27 Jun 2026 00:46:38 +0800	[thread overview]
Message-ID: <edee8461-34c6-41e9-ae0c-076380d92ebb@linux.dev> (raw)
In-Reply-To: <61e9fcf7-02a5-4285-948b-62fba4dcd69c@arm.com>



On 2026/6/26 23:26, Dev Jain wrote:
> 
> 
> On 26/06/26 7:40 pm, Lance Yang wrote:
>>
>> On Fri, Jun 26, 2026 at 06:53:10PM +0530, Dev Jain wrote:
>>>
>>>
>>> On 26/06/26 1:18 pm, Lance Yang wrote:
>>>>
>>>> On Thu, Jun 25, 2026 at 11:29:53AM +0000, Dev Jain wrote:
>>>>> check_pte() is the final validation step in page_vma_mapped_walk().
>>>>> It reads pvmw->pte with ptep_get() to decide whether the entry maps
>>>>> the PFN range being walked. For hugetlb VMAs, that pointer refers
>>>>> to a hugetlb entry.
>>>>>
>>>>> On arches which provide their own huge_ptep_get() to dereference a huge
>>>>> pte pointer, accessing via ptep_get() would cause pte_pfn(),
>>>>> pte_present() etc to misbehave.
>>>>>
>>>>> It is not clear whether this has a trivially visible effect to userspace.
>>>>>
>>>>> Use huge_ptep_get() to dereference a huge pte pointer.
>>>>>
>>>>> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
>>>>> Cc: stable@vger.kernel.org
>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>> ---
>>>>> mm/page_vma_mapped.c | 8 +++++++-
>>>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>>> index 2ccbabfb2cc17..18e1d341f463c 100644
>>>>> --- a/mm/page_vma_mapped.c
>>>>> +++ b/mm/page_vma_mapped.c
>>>>> @@ -107,7 +107,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
>>>>> static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>>>> {
>>>>> 	unsigned long pfn;
>>>>> -	pte_t ptent = ptep_get(pvmw->pte);
>>>>> +	pte_t ptent;
>>>>> +
>>>>> +	if (is_vm_hugetlb_page(pvmw->vma))
>>>>> +		ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>>>>> +				      pvmw->pte);
>>>>
>>>> I think check_pte() can pass a wrong address to huge_ptep_get() ...
>>>
>>> Won't this be handled by rmap_walk_anon/rmap_walk_file - they are the ones
>>> performing the rmap traversal and passing address to try_to_unmap_one/folio_referenced_one
>>> etc ...
>>
>> Right, that should cover the rmap callbacks. The bit I was worried about
>> is page_mapped_in_vma() though.
>>
>>>>
>>>> Not sure that is wrong in the first place. For memory failure,
>>>> page_mapped_in_vma() can be called with a poisoned tail page of a hugetlb
>>>> folio. In that case, pvmw->address need not be hugepage-aligned.
>>>>
>>>> @Miaohe
>>
>> For hugetlb memory failure we start with the poisoned PFN:
>>
>> static int try_memory_failure_hugetlb(unsigned long pfn, int flags)
>> {
>> 	...
>> 	struct page *p = pfn_to_page(pfn);
>> 	struct folio *folio;
>> 	...
>>
>> 	folio = page_folio(p);
>>
>> 	...
>>
>> 	if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
>> 		...
>> 	}
>>
>> 	...
>> }
>>
>> and pass the same p down:
>>
>> static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>> 		unsigned long pfn, int flags)
>> {
>> 	...
>>
>> 	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>>
>> 	...
>> }
>>
>> static void collect_procs(const struct folio *folio, const struct page *page,
>> 		struct list_head *tokill, int force_early)
>> {
>> 	...
>>
>> 	if (unlikely(folio_test_ksm(folio)))
>> 		...
>> 	else if (folio_test_anon(folio))
>> 		collect_procs_anon(folio, page, tokill, force_early);
>> 	else
>> 		...
>> }
>>
>> So collect_procs_anon() still gets the poisoned page, not &folio->page:
>>
>> static void collect_procs_anon(const struct folio *folio,
>> 		const struct page *page, struct list_head *to_kill,
>> 		int force_early)
>> {
>> 	...
>>
>> 	pgoff = page_pgoff(folio, page);
>> 	rcu_read_lock();
>> 	for_each_process(tsk) {
>> 		...
>> 		
>> 		anon_vma_interval_tree_foreach(vmac, &av->rb_root,
>> 					       pgoff, pgoff) {
>> 			...
>> 			addr = page_mapped_in_vma(page, vma);
>> 			...
>> 		}
>> 	}
>> 	rcu_read_unlock();
>> 	anon_vma_unlock_read(av);
>> }
>>
>> page_mapped_in_vma() then builds pvmw for that page:
>>
>> unsigned long page_mapped_in_vma(const struct page *page,
>> 		struct vm_area_struct *vma)
>> {
>> 	const struct folio *folio = page_folio(page);
>> 	struct page_vma_mapped_walk pvmw = {
>> 		.pfn = page_to_pfn(page),
>> 		.nr_pages = 1,
>> 		.vma = vma,
>> 		.flags = PVMW_SYNC,
>> 	};
>>
>> 	pvmw.address = vma_address(vma, page_pgoff(folio, page), 1);
>> 	...
>> }
>>
>> and page_pgoff() includes the subpage index:
>>
>> static inline pgoff_t page_pgoff(const struct folio *folio,
>> 		const struct page *page)
>> {
>> 	return folio->index + folio_page_idx(folio, page);
>> }
>>
>> So if the poisoned PFN points to a tail page, pvmw->address can be offset
>> from the start of the hugetlb mapping by
>>
>> folio_page_idx(folio, page) << PAGE_SHIFT
>>
>> Should check_pte() pass the hugepage-aligned address to huge_ptep_get()
>> for that case?
> 
> Thanks! This looks correct.
> 
> I can indeed fix this up in check_pte. But in the memory-failure path
> it has always been confusing to me for hugetlb folios why we are bothering
> with the tail page. I am sure that area can also be simplified. But for
> now I'll just do a simple fix here itself.

Just thinking out loud: given that huge_ptep_get() already assumes that
addr matches the huge pte, at least on arm64, would it make sense to
have a small hugetlb wrapper around it that takes hstate and aligns
the address before calling the arch helper?

Might make the rule clearer, and a bit harder to get wrong again :)

Thanks, Lance

> 
>>
>> Cheers, Lance
>>
>>>>
>>>> For arm64, CONT_PMD_SIZE is one supported HugeTLB size. With such a VMA,
>>>> page_vma_mapped_walk() passes that size to hugetlb_walk():
>>>>
>>>> bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>> {
>>>> 	...
>>>> 	if (unlikely(is_vm_hugetlb_page(vma))) {
>>>> 		...
>>>> 		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>>>> 		...
>>>> 	}
>>>> 	...
>>>> }
>>>>
>>>> hugetlb_walk() then calls arm64 huge_pte_offset(mm, addr, sz). For
>>>> sz == CONT_PMD_SIZE, huge_pte_offset() aligns its local addr before
>>>> calculating pmdp:
>>>>
>>>> pte_t *huge_pte_offset(struct mm_struct *mm,
>>>> 		       unsigned long addr, unsigned long sz)
>>>> {
>>>> 	...
>>>> 	if (sz == CONT_PMD_SIZE)
>>>> 		addr &= CONT_PMD_MASK;
>>>>
>>>> 	pmdp = pmd_offset(pudp, addr);
>>>> 	pmd = READ_ONCE(*pmdp);
>>>> 	...
>>>> }
>>>>
>>>> So for that case, pvmw->pte is calculated from the aligned addr, not
>>>> necessarily from the original pvmw->address. But check_pte() passes the
>>>> original address together with pvmw->pte:
>>>>
>>>> +		ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>>>> +				      pvmw->pte);
>>>>
>>>> arm64 then uses that addr again to choose ncontig:
>>>>
>>>> pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>> {
>>>> 	...
>>>> 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>>> 	for (i = 0; i < ncontig; i++, ptep++) {
>>>> 		...
>>>> 	}
>>>> 	return orig_pte;
>>>> }
>>>>
>>>> static int find_num_contig(struct mm_struct *mm, unsigned long addr,
>>>> 			   pte_t *ptep, size_t *pgsize)
>>>> {
>>>> 	pgd_t *pgdp = pgd_offset(mm, addr);
>>>> 	p4d_t *p4dp;
>>>> 	pud_t *pudp;
>>>> 	pmd_t *pmdp;
>>>>
>>>> 	*pgsize = PAGE_SIZE;
>>>> 	p4dp = p4d_offset(pgdp, addr);
>>>> 	pudp = pud_offset(p4dp, addr);
>>>> 	pmdp = pmd_offset(pudp, addr);
>>>> 	if ((pte_t *)pmdp == ptep) {
>>>> 		*pgsize = PMD_SIZE;
>>>> 		return CONT_PMDS;
>>>> 	}
>>>> 	return CONT_PTES;
>>>> }
>>>>
>>>> With a tail address, pmdp may no longer point at pvmw->pte, so
>>>> find_num_contig() can return CONT_PTES for a CONT_PMD HugeTLB mapping.
>>>>
>>>> On 16K arm64, that changes ncontig from 32 to 128. So huge_ptep_get()
>>>> can walk past the CONT_PMD entries, and possibly past the PMD table.
>>>>
>>>> Should check_pte() pass the address matching pvmw->pte, sth like:
>>>>
>>>> ---8<---
>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>> index 406fd50bbd8f..58463493bd3d 100644
>>>> --- a/mm/page_vma_mapped.c
>>>> +++ b/mm/page_vma_mapped.c
>>>> @@ -109,11 +109,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>>>   	unsigned long pfn;
>>>>   	pte_t ptent;
>>>>
>>>> -	if (is_vm_hugetlb_page(pvmw->vma))
>>>> -		ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>>>> -				      pvmw->pte);
>>>> -	else
>>>> +	if (is_vm_hugetlb_page(pvmw->vma)) {
>>>> +		struct hstate *hstate = hstate_vma(pvmw->vma);
>>>> +		unsigned long haddr = pvmw->address & huge_page_mask(hstate);
>>>> +
>>>> +		ptent = huge_ptep_get(pvmw->vma->vm_mm, haddr, pvmw->pte);
>>>> +	} else {
>>>>   		ptent = ptep_get(pvmw->pte);
>>>> +	}
>>>>
>>>>   	if (pvmw->flags & PVMW_MIGRATION) {
>>>>   		const softleaf_t entry = softleaf_from_pte(ptent);
>>>> --
>>>>
>>>> while leaving pvmw->address unchanged for page_mapped_in_vma()?
>>>>
>>>> Cheers, Lance
>>>>
>>>>> +	else
>>>>> +		ptent = ptep_get(pvmw->pte);
>>>>>
>>>>> 	if (pvmw->flags & PVMW_MIGRATION) {
>>>>> 		const softleaf_t entry = softleaf_from_pte(ptent);
>>>>> -- 
>>>>> 2.43.0
>>>>>
>>>>>
>>>
>>>
>>
>

next prev parent reply	other threads:[~2026-06-26 16:46 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 11:29 [PATCH 0/5] Fix incorrect access of hugetlb pte entries Dev Jain
2026-06-25 11:29 ` [PATCH 1/5] mm/rmap: use huge_ptep_get() in try_to_unmap_one() Dev Jain
2026-06-26  3:17   ` Muchun Song
2026-06-26  4:03     ` Dev Jain
2026-06-26  4:16       ` Muchun Song
2026-06-25 11:29 ` [PATCH 2/5] mm/rmap: use huge_ptep_get() in try_to_migrate_one() Dev Jain
2026-06-26  3:24   ` Muchun Song
2026-06-25 11:29 ` [PATCH 3/5] mm/migrate: use huge_ptep_get() in remove_migration_pte() Dev Jain
2026-06-26  3:32   ` Muchun Song
2026-06-25 11:29 ` [PATCH 4/5] mm/page_vma_mapped: use huge_ptep_get() for hugetlb Dev Jain
2026-06-26  2:31   ` Lance Yang
2026-06-26  4:06     ` Dev Jain
2026-06-26  7:48   ` Lance Yang
2026-06-26  9:14     ` Lance Yang
2026-06-26 13:23     ` Dev Jain
2026-06-26 14:10       ` Lance Yang
2026-06-26 15:26         ` Dev Jain
2026-06-26 16:46           ` Lance Yang [this message]
2026-06-25 11:29 ` [PATCH 5/5] mm/mprotect: " Dev Jain
2026-06-26  3:40   ` Muchun Song
2026-06-26  4:08     ` Dev Jain
2026-06-26  4:21       ` Muchun Song
2026-06-26  4:42         ` Dev Jain
2026-06-25 13:59 ` [PATCH 0/5] Fix incorrect access of hugetlb pte entries Zi Yan
2026-06-26  4:09   ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=edee8461-34c6-41e9-ae0c-076380d92ebb@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=harry@kernel.org \
    --cc=j-nomura@ce.jp.nec.com \
    --cc=jannh@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=jpoimboe@kernel.org \
    --cc=kas@kernel.org \
    --cc=liam@infradead.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=mel@csn.ul.ie \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=rcampbell@nvidia.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.