Linux Documentation
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <ljs@kernel.org>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Nico Pache <npache@redhat.com>,
	linux-doc@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	 aarcange@redhat.com, akpm@linux-foundation.org,
	anshuman.khandual@arm.com,  apopple@nvidia.com,
	baohua@kernel.org, baolin.wang@linux.alibaba.com,
	 byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org,
	corbet@lwn.net,  dave.hansen@linux.intel.com, dev.jain@arm.com,
	gourry@gourry.net, hannes@cmpxchg.org,  hughd@google.com,
	jack@suse.cz, jackmanb@google.com, jannh@google.com,
	 jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org,
	lance.yang@linux.dev,  liam@infradead.org,
	mathieu.desnoyers@efficios.com, matthew.brost@intel.com,
	 mhiramat@kernel.org, mhocko@suse.com, peterx@redhat.com,
	pfalcato@suse.de,  rakie.kim@sk.com, raquini@redhat.com,
	rdunlap@infradead.org,  richard.weiyang@gmail.com,
	rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org,
	 ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com,
	surenb@google.com,  thomas.hellstrom@linux.intel.com,
	tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz,
	 vishal.moola@gmail.com, wangkefeng.wang@huawei.com,
	will@kernel.org, willy@infradead.org,
	 yang@os.amperecomputing.com, ying.huang@linux.alibaba.com,
	ziy@nvidia.com, zokeefe@google.com
Subject: Re: [PATCH mm-unstable v19 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
Date: Fri, 5 Jun 2026 19:15:20 +0100	[thread overview]
Message-ID: <aiMNT7fBUkZS1EJK@lucifer> (raw)
In-Reply-To: <95390529-3a80-473c-9433-958db7a2dc6c@kernel.org>

On Fri, Jun 05, 2026 at 07:48:17PM +0200, David Hildenbrand (Arm) wrote:
> On 6/5/26 18:14, Nico Pache wrote:
> > Pass an order to collapse_huge_page to support collapsing anon memory to
> > arbitrary orders within a PMD. order indicates what mTHP size we are
> > attempting to collapse to.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the mTHP case this is not true, and we must keep the lock to prevent
> > access/changes to the page tables. This can happen if the rmap walkers hit
> > a pmd_none while the PMD entry is currently unavailable due to being
> > temporarily removed during the collapse phase.
> >
> > To properly establish the page table hierarchy without violating any
> > expectations from certain architectures (e.g. MIPS), we must make sure to
> > have the PMD reinstalled before the PTEs, and hold both PTE/PMD locks
> > before calling update_mmu_cache_range() (if they are distinct locks).
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
>
> [...]
>
> >  	 */
> >  	__folio_mark_uptodate(folio);
> > -	pgtable = pmd_pgtable(_pmd);
> > -
> >  	spin_lock(pmd_ptl);
> > -	BUG_ON(!pmd_none(*pmd));
> > -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > +	VM_WARN_ON_ONCE(!pmd_none(*pmd));
> > +	if (is_pmd_order(order)) {
> > +		pgtable = pmd_pgtable(_pmd);
> > +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > +	} else {
> > +		/*
> > +		 * Some architectures (e.g. MIPS) walk the live page table in
> > +		 * their implementation. update_mmu_cache_range() must be called
> > +		 * with a valid page table hierarchy and the PTE lock held.
> > +		 * Acquire it nested inside pmd_ptl when they are distinct locks.
> > +		 */
> > +		if (pte_ptl != pmd_ptl)
> > +			spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> > +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> > +					  /*uffd_wp=*/ false);
> > +		if (pte_ptl != pmd_ptl)
> > +			spin_unlock(pte_ptl);
> > +	}
> >  	spin_unlock(pmd_ptl);
> >
> >  	folio = NULL;
> >
> >  	result = SCAN_SUCCEED;
> >  out_up_write:
> > +	if (anon_vma_locked)
> > +		anon_vma_unlock_write(vma->anon_vma);
> > +	if (pte)
> > +		pte_unmap(pte);
>
> We re-enable some page table walkers before we unmap the PTE.
>
> We still hold the mmap lock in write mode, so nothing would currently try
> reclaiming the page table concurrently.

Reclaim uses rmap walkers though?

Oh you mean as in page table teardown, we're safe from higher level page table
teardown, but we're not safe from zap PTE page table teardown, as
CONFIG_PT_RECLAIM makes this possible on zap now.

That is RCU safe, so the unmap would keep us safe here, but now we could lose
the PTE page table.

But, only MADV_DONTNEED sets reclaim_pt = true, and that holds the VMA read lock
so we're safe.

And anyway:

MADV_DONTNEED - VMA read lock (we hold VMA write lock)
zap_vma_for_reaping() - mmap read lock
process teardown, munmap - mmap read lock
fault - vma/mmap read lock

So the vma/mmap locks save us from those.

So rmap-wise, only the i_mmap walkers remain (truncate, hole punch, et al. and
also hugetlbfs truncate/hole-punch which does its own nonsense too), but none of
those allow for reclaim_pt to happen in any case.

So yeah we're safe but we should what, reorder these 2 statements?

But yes I agree that can be a follow-up, nothing's broken AFAICT.

>
> So I guess this works right now, but we should likely rework that code later to
> either revert both statements. Or maybe we can simply unmap like we did, and
> simply remap before we call map_anon_folio_pte_nopf()? Remapping should not fail.
>
> Alternatively to an unmap+remap, I think we could also unmap earlier for PMD
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6de935e76ceb..ba2a2508dda6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1378,6 +1378,8 @@ static enum scan_result collapse_huge_page(struct
> mm_struct *mm, unsigned long s
>         if (is_pmd_order(order)) {
>                 anon_vma_unlock_write(vma->anon_vma);
>                 anon_vma_locked = false;
> +               pte_unmap(pte);
> +               pte = NULL;
>         }
>
>         result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>
> But this can also be handled later.

Yup I mean it'd be nicer to do it in one place if we can (+ impact of holding
RCU lock longer not an issue), but all this code needs rewokr anyway.

>
> We now hold an anon_vma lock a bit longer for !pmd-collapse. But there is also
> less to copy. If that bites us, we can try optimizing later.

Yeah I do worry about holding these locks longer. But we'll see.

>
>
> So after another skim, I think this patch is ready for primetime. We can address
> the things mentioned above later ... and any fallout can be fixed later, if any.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Yes, also from my side - after a git range-diff and looking into above, LGTM,
so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

Now to go look at the core algo patch :)

>
>
> --
> Cheers,
>
> David

Cheers, Lorenzo

  reply	other threads:[~2026-06-05 18:15 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-05 16:14 [PATCH mm-unstable v19 00/14] khugepaged: add mTHP collapse support Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 02/14] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-06-05 19:03   ` Zi Yan
2026-06-05 16:14 ` [PATCH mm-unstable v19 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
2026-06-05 20:07   ` Zi Yan
2026-06-05 16:14 ` [PATCH mm-unstable v19 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-06-05 17:48   ` David Hildenbrand (Arm)
2026-06-05 18:15     ` Lorenzo Stoakes [this message]
2026-06-05 18:18   ` Lorenzo Stoakes
2026-06-05 16:14 ` [PATCH mm-unstable v19 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 09/14] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 10/14] mm/khugepaged: introduce collapse_possible_orders helper functions Nico Pache
2026-06-05 17:46   ` Lorenzo Stoakes
2026-06-05 16:14 ` [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
2026-06-05 18:03   ` David Hildenbrand (Arm)
2026-06-05 18:38   ` Lorenzo Stoakes
2026-06-05 16:14 ` [PATCH mm-unstable v19 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-06-05 17:49   ` David Hildenbrand (Arm)
2026-06-05 18:16     ` Lorenzo Stoakes
2026-06-05 16:14 ` [PATCH mm-unstable v19 13/14] mm/khugepaged: run khugepaged for all orders Nico Pache
2026-06-05 16:14 ` [PATCH mm-unstable v19 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-06-05 17:52   ` David Hildenbrand (Arm)
2026-06-05 18:20   ` Lorenzo Stoakes
2026-06-05 18:07 ` [PATCH mm-unstable v19 00/14] khugepaged: add mTHP collapse support David Hildenbrand (Arm)
2026-06-05 18:39   ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiMNT7fBUkZS1EJK@lucifer \
    --to=ljs@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shivankg@amd.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox