[PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites
@ 2026-03-11 21:13 Nico Pache
  2026-03-11 21:13 ` [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers Nico Pache
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Nico Pache @ 2026-03-11 21:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jackmanb, jack, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

The following series contains cleanups and prerequisites for my work on
khugepaged mTHP support [1]. These have been separated out to ease review.

The first patch in the series refactors the page fault folio to pte mapping
and follows a similar convention as defined by map_anon_folio_pmd_(no)pf().
This not only cleans up the current implementation of do_anonymous_page(),
but will allow for reuse later in the khugepaged mTHP implementation.

The second patch adds a small is_pmd_order() helper to check if an order is
the PMD order. This check is open-coded in a number of places. This patch
aims to clean this up and will be used more in the khugepaged mTHP work.
The third patch also adds a small DEFINE for (HPAGE_PMD_NR - 1) which is
used often across the khugepaged code.

The fourth and fifth patch come from the khugepaged mTHP patchset [1].
These two patches include the rename of function prefixes, and the
unification of khugepaged and madvise_collapse via a new
collapse_single_pmd function.

Patch 1:     refactor do_anonymous_page into map_anon_folio_pte_(no)pf
Patch 2:     add is_pmd_order helper
Patch 3:     Add define for (HPAGE_PMD_NR - 1)
Patch 4:     Refactor/rename hpage_collapse
Patch 5:     Refactoring to combine madvise_collapse and khugepaged

Testing:
- Built for x86_64, aarch64, ppc64le, and s390x
- ran all arches on test suites provided by the kernel-tests project
- selftests mm

V3 Changes:
- Patch1: leverage folio_order rather than passing nr_pages into
  map_anon_folio_pte_pf [2]
- Patch 2: fixed conflict with is_pmd_order
- Patch 3: Change COLLAPSE_MAX_PTES_LIMIT to KHUGEPAGED_MAX_PTES_LIMIT [3]
- Patch 4: conflict resolution with other khugepaged patches
- Patch 5: Drop lock_dropped unnecessary change, Dropped Lorenzos RB,
  !triggered_wb in if statement rather than triggered_wb, remove
  unnecessary mmap_locked flipping [4]

V2 - https://lore.kernel.org/all/20260226012929.169479-1-npache@redhat.com/
V1 - https://lore.kernel.org/lkml/20260212021835.17755-1-npache@redhat.com/

A big thanks to everyone that has reviewed, tested, and participated in
the development process.

[1] - https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com/
[2] - https://lore.kernel.org/all/7334b702-f6a0-4ccf-8ac6-8426a90d1846@kernel.org/
[3] - https://lore.kernel.org/all/25723c0f-c702-44ad-93e9-1056313680cd@kernel.org/
[4] - https://lore.kernel.org/all/81ff9caa-50f2-4951-8d82-2c8dcdf3db91@kernel.org/

Nico Pache (5):
  mm: consolidate anonymous folio PTE mapping into helpers
  mm: introduce is_pmd_order helper
  mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1
  mm/khugepaged: rename hpage_collapse_* to collapse_*
  mm/khugepaged: unify khugepaged and madv_collapse with
    collapse_single_pmd()

 include/linux/huge_mm.h |   5 ++
 include/linux/mm.h      |   4 +
 mm/huge_memory.c        |   2 +-
 mm/khugepaged.c         | 187 +++++++++++++++++++++-------------------
 mm/memory.c             |  62 ++++++++-----
 mm/mempolicy.c          |   2 +-
 mm/mremap.c             |   2 +-
 mm/page_alloc.c         |   4 +-
 mm/shmem.c              |   3 +-
 9 files changed, 153 insertions(+), 118 deletions(-)

-- 
2.53.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers
  2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
@ 2026-03-11 21:13 ` Nico Pache
  2026-03-16 18:17   ` Lorenzo Stoakes (Oracle)
  2026-03-11 21:13 ` [PATCH mm-unstable v3 2/5] mm: introduce is_pmd_order helper Nico Pache
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Nico Pache @ 2026-03-11 21:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jackmanb, jack, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

The anonymous page fault handler in do_anonymous_page() open-codes the
sequence to map a newly allocated anonymous folio at the PTE level:
	- construct the PTE entry
	- add rmap
	- add to LRU
	- set the PTEs
	- update the MMU cache.

Introduce a two helpers to consolidate this duplicated logic, mirroring the
existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings:

	map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio
	references, adds anon rmap and LRU. This function also handles the
	uffd_wp that can occur in the pf variant.

	map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES
	counter updates, and mTHP fault allocation statistics for the page fault
	path.

The zero-page read path in do_anonymous_page() is also untangled from the
shared setpte label, since it does not allocate a folio and should not
share the same mapping sequence as the write path. Make nr_pages = 1
rather than relying on the variable. This makes it more clear that we
are operating on the zero page only.

This refactoring will also help reduce code duplication between mm/memory.c
and mm/khugepaged.c, and provides a clean API for PTE-level anonymous folio
mapping that can be reused by future callers.

Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/mm.h |  4 ++++
 mm/memory.c        | 60 +++++++++++++++++++++++++++++++---------------
 2 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4c4fd55fc823..9fea354bd17f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4903,4 +4903,8 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
 
 void snapshot_page(struct page_snapshot *ps, const struct page *page);
 
+void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
+		struct vm_area_struct *vma, unsigned long addr,
+		bool uffd_wp);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 6aa0ea4af1fc..5c8bf1eb55f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5197,6 +5197,37 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
 }
 
+void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
+		struct vm_area_struct *vma, unsigned long addr,
+		bool uffd_wp)
+{
+	unsigned int nr_pages = folio_nr_pages(folio);
+	pte_t entry = folio_mk_pte(folio, vma->vm_page_prot);
+
+	entry = pte_sw_mkyoung(entry);
+
+	if (vma->vm_flags & VM_WRITE)
+		entry = pte_mkwrite(pte_mkdirty(entry), vma);
+	if (uffd_wp)
+		entry = pte_mkuffd_wp(entry);
+
+	folio_ref_add(folio, nr_pages - 1);
+	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
+	folio_add_lru_vma(folio, vma);
+	set_ptes(vma->vm_mm, addr, pte, entry, nr_pages);
+	update_mmu_cache_range(NULL, vma, addr, pte, nr_pages);
+}
+
+static void map_anon_folio_pte_pf(struct folio *folio, pte_t *pte,
+		struct vm_area_struct *vma, unsigned long addr, bool uffd_wp)
+{
+	unsigned int order = folio_order(folio);
+
+	map_anon_folio_pte_nopf(folio, pte, vma, addr, uffd_wp);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1 << order);
+	count_mthp_stat(order, MTHP_STAT_ANON_FAULT_ALLOC);
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -5243,7 +5274,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		if (vmf_orig_pte_uffd_wp(vmf))
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache_range(vmf, vma, addr, vmf->pte,
+				       /*nr_pages=*/ 1);
+		goto unlock;
 	}
 
 	/* Allocate our own private page. */
@@ -5267,11 +5305,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 */
 	__folio_mark_uptodate(folio);
 
-	entry = folio_mk_pte(folio, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry), vma);
-
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (!vmf->pte)
 		goto release;
@@ -5293,19 +5326,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		folio_put(folio);
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
-
-	folio_ref_add(folio, nr_pages - 1);
-	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
-	count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC);
-	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-setpte:
-	if (vmf_orig_pte_uffd_wp(vmf))
-		entry = pte_mkuffd_wp(entry);
-	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
+	map_anon_folio_pte_pf(folio, vmf->pte, vma, addr,
+			      vmf_orig_pte_uffd_wp(vmf));
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers
  2026-03-11 21:13 ` [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers Nico Pache
@ 2026-03-16 18:17   ` Lorenzo Stoakes (Oracle)
  2026-03-18 16:48     ` Nico Pache
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 18:17 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Wed, Mar 11, 2026 at 03:13:11PM -0600, Nico Pache wrote:
> The anonymous page fault handler in do_anonymous_page() open-codes the
> sequence to map a newly allocated anonymous folio at the PTE level:
> 	- construct the PTE entry
> 	- add rmap
> 	- add to LRU
> 	- set the PTEs
> 	- update the MMU cache.

Yikes yeah this all needs work. Thanks for looking at this!

>
> Introduce a two helpers to consolidate this duplicated logic, mirroring the

NIT: 'Introduce a two helpers' -> "introduce two helpers'

> existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings:
>
> 	map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio
> 	references, adds anon rmap and LRU. This function also handles the
> 	uffd_wp that can occur in the pf variant.
>
> 	map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES
> 	counter updates, and mTHP fault allocation statistics for the page fault
> 	path.

MEGA nit, not sure why you're not just putting this in a bullet list, just
weird to see not-code indented here :)

>
> The zero-page read path in do_anonymous_page() is also untangled from the
> shared setpte label, since it does not allocate a folio and should not
> share the same mapping sequence as the write path. Make nr_pages = 1
> rather than relying on the variable. This makes it more clear that we
> are operating on the zero page only.
>
> This refactoring will also help reduce code duplication between mm/memory.c
> and mm/khugepaged.c, and provides a clean API for PTE-level anonymous folio
> mapping that can be reused by future callers.

Maybe worth mentioning subseqent patches that will use what you set up
here?

Also you split things out into _nopf() and _pf() variants, it might be
worth saying exactly why you're doing that or what you are preparing to do?

>
> Reviewed-by: Dev Jain <dev.jain@arm.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Nico Pache <npache@redhat.com>

There's nits above and below, but overall the logic looks good, so with
nits addressed/reasonably responded to:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  include/linux/mm.h |  4 ++++
>  mm/memory.c        | 60 +++++++++++++++++++++++++++++++---------------
>  2 files changed, 45 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4c4fd55fc823..9fea354bd17f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4903,4 +4903,8 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
>
>  void snapshot_page(struct page_snapshot *ps, const struct page *page);
>
> +void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
> +		struct vm_area_struct *vma, unsigned long addr,
> +		bool uffd_wp);

How I hate how uffd infiltrates all our code like this.

Not your fault :)

> +
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index 6aa0ea4af1fc..5c8bf1eb55f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5197,6 +5197,37 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
>  }
>
> +void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
> +		struct vm_area_struct *vma, unsigned long addr,
> +		bool uffd_wp)
> +{
> +	unsigned int nr_pages = folio_nr_pages(folio);

const would be good

> +	pte_t entry = folio_mk_pte(folio, vma->vm_page_prot);
> +
> +	entry = pte_sw_mkyoung(entry);
> +
> +	if (vma->vm_flags & VM_WRITE)
> +		entry = pte_mkwrite(pte_mkdirty(entry), vma);
> +	if (uffd_wp)
> +		entry = pte_mkuffd_wp(entry);
> +
> +	folio_ref_add(folio, nr_pages - 1);
> +	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> +	folio_add_lru_vma(folio, vma);
> +	set_ptes(vma->vm_mm, addr, pte, entry, nr_pages);
> +	update_mmu_cache_range(NULL, vma, addr, pte, nr_pages);
> +}
> +
> +static void map_anon_folio_pte_pf(struct folio *folio, pte_t *pte,
> +		struct vm_area_struct *vma, unsigned long addr, bool uffd_wp)
> +{
> +	unsigned int order = folio_order(folio);

const would be good here also!

> +
> +	map_anon_folio_pte_nopf(folio, pte, vma, addr, uffd_wp);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1 << order);

Is 1 << order strictly right here? This field is a long value, so 1L <<
order maybe? I get nervous about these shifts...

Note that folio_large_nr_pages() uses 1L << order so that does seem
preferable.

> +	count_mthp_stat(order, MTHP_STAT_ANON_FAULT_ALLOC);
> +}
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -5243,7 +5274,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>  		}
> -		goto setpte;
> +		if (vmf_orig_pte_uffd_wp(vmf))
> +			entry = pte_mkuffd_wp(entry);
> +		set_pte_at(vma->vm_mm, addr, vmf->pte, entry);

How I _despise_ how uffd is implemented in mm. Feels like open coded
nonsense spills out everywhere.

Not your fault obviously :)

> +
> +		/* No need to invalidate - it was non-present before */
> +		update_mmu_cache_range(vmf, vma, addr, vmf->pte,
> +				       /*nr_pages=*/ 1);

Is there any point in passing vmf here given you pass NULL above, and it
appears that nobody actually uses this? I guess it doesn't matter but
seeing this immediately made we question why you set it in one, and not the
other?

Maybe I'm mistaken and some arch uses it? Don't think so though.

Also can't we then just use update_mmu_cache() which is the single-page
wrapper of this AFAICT? That'd make it even simpler.

Having done this, is there any reason to keep the annoying and confusing
initial assignment of nr_pages = 1 at declaration time?

It seems that nr_pages is unconditionally assigned before it's used
anywhere now at line 5298:

	nr_pages = folio_nr_pages(folio);
	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
	...

It's kinda weird to use nr_pages again after you go out of your way to
avoid it using folio_nr_pages() in map_anon_folio_pte_nopf() and
folio_order() in map_anon_folio_pte_pf().

But yeah, ok we align the address and it's yucky maybe leave for now (but
we can definitely stop defaulting nr_pages to 1 :)

> +		goto unlock;
>  	}
>
>  	/* Allocate our own private page. */
> @@ -5267,11 +5305,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	 */
>  	__folio_mark_uptodate(folio);
>
> -	entry = folio_mk_pte(folio, vma->vm_page_prot);
> -	entry = pte_sw_mkyoung(entry);
> -	if (vma->vm_flags & VM_WRITE)
> -		entry = pte_mkwrite(pte_mkdirty(entry), vma);
> -
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (!vmf->pte)
>  		goto release;
> @@ -5293,19 +5326,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		folio_put(folio);
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
> -
> -	folio_ref_add(folio, nr_pages - 1);
> -	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> -	count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC);
> -	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -setpte:
> -	if (vmf_orig_pte_uffd_wp(vmf))
> -		entry = pte_mkuffd_wp(entry);
> -	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
> -
> -	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> +	map_anon_folio_pte_pf(folio, vmf->pte, vma, addr,
> +			      vmf_orig_pte_uffd_wp(vmf));

So we're going from:

entry = folio_mk_pte(...)
entry = pte_sw_mkyoung(...)
if (write)
	entry = pte_mkwrite(also dirty...)
folio_ref_add(nr_pages - 1)
add_mm_counter(... MM_ANON_PAGES, nr_pages)
count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC)
folio_add_new_anon_rmap(.., RMAP_EXCLUSIVE)
folio_add_lru_vma(folio, vma)
if (vmf_orig_pte_uffd_wp(vmf))
	entry = pte_mkuffd_wp(entry)
set_ptes(mm, addr, vmf->pte, entry, nr_pages)
update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages)

To:

<map_anon_folio_pte_nopf>
entry = folio_mk_pte(...)
entry = pte_sw_mkyoung(...)
if (write)
	entry = pte_mkwrite(also dirty...)
if (vmf_orig_pte_uffd_wp(vmf) <passed in via uffd_wp>) <-- reordered
	entry = pte_mkuffd_wp(entry)
folio_ref_add(nr_pages - 1)
folio_add_new_anon_rmap(.., RMAP_EXCLUSIVE)
folio_add_lru_vma(folio, vma)
set_ptes(mm, addr, pte, entry, nr_pages)
update_mmu_cache_range(NULL, vma, addr, pte, nr_pages)

<map_anon_folio_pte_pf>
add_mm_counter(... MM_ANON_PAGES, nr_pages) <-- reordered
count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC) <-- reodrdered

But the reorderings seem fine, and it is achieving the same thing.

All the parameters being passed seem correct too.

>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> --
> 2.53.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers
  2026-03-16 18:17   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 16:48     ` Nico Pache
  0 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2026-03-18 16:48 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe



On 3/16/26 12:17 PM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Mar 11, 2026 at 03:13:11PM -0600, Nico Pache wrote:
>> The anonymous page fault handler in do_anonymous_page() open-codes the
>> sequence to map a newly allocated anonymous folio at the PTE level:
>> 	- construct the PTE entry
>> 	- add rmap
>> 	- add to LRU
>> 	- set the PTEs
>> 	- update the MMU cache.
> 
> Yikes yeah this all needs work. Thanks for looking at this!

np! I believe it looks much cleaner now, and it has the added benefit of
cleaning up some of the mTHP patchset.

I also believe you suggested this so I'll add your SB when I add your RB tag :)

> 
>>
>> Introduce a two helpers to consolidate this duplicated logic, mirroring the
> 
> NIT: 'Introduce a two helpers' -> "introduce two helpers'

ack!

> 
>> existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings:
>>
>> 	map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio
>> 	references, adds anon rmap and LRU. This function also handles the
>> 	uffd_wp that can occur in the pf variant.
>>
>> 	map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES
>> 	counter updates, and mTHP fault allocation statistics for the page fault
>> 	path.
> 
> MEGA nit, not sure why you're not just putting this in a bullet list, just
> weird to see not-code indented here :)

Ill drop the indentation!

> 
>>
>> The zero-page read path in do_anonymous_page() is also untangled from the
>> shared setpte label, since it does not allocate a folio and should not
>> share the same mapping sequence as the write path. Make nr_pages = 1
>> rather than relying on the variable. This makes it more clear that we
>> are operating on the zero page only.
>>
>> This refactoring will also help reduce code duplication between mm/memory.c
>> and mm/khugepaged.c, and provides a clean API for PTE-level anonymous folio
>> mapping that can be reused by future callers.
> 
> Maybe worth mentioning subseqent patches that will use what you set up
> here?

ok ill add something like "... that will be used when adding mTHP support to
khugepaged, and may find future reuse by other callers." Speaking of which I
tried to leverage this elsewhere and I believe that will take extra focus and
time, but may be doable.

> 
> Also you split things out into _nopf() and _pf() variants, it might be
> worth saying exactly why you're doing that or what you are preparing to do?

ok ill expand on this in the "bullet list" you reference above.

> 
>>
>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> There's nits above and below, but overall the logic looks good, so with
> nits addressed/reasonably responded to:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Thanks :) Ill take care of those

> 
>> ---
>>  include/linux/mm.h |  4 ++++
>>  mm/memory.c        | 60 +++++++++++++++++++++++++++++++---------------
>>  2 files changed, 45 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 4c4fd55fc823..9fea354bd17f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4903,4 +4903,8 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
>>
>>  void snapshot_page(struct page_snapshot *ps, const struct page *page);
>>
>> +void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
>> +		struct vm_area_struct *vma, unsigned long addr,
>> +		bool uffd_wp);
> 
> How I hate how uffd infiltrates all our code like this.
> 
> Not your fault :)
> 
>> +
>>  #endif /* _LINUX_MM_H */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 6aa0ea4af1fc..5c8bf1eb55f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5197,6 +5197,37 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>  	return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
>>  }
>>
>> +void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
>> +		struct vm_area_struct *vma, unsigned long addr,
>> +		bool uffd_wp)
>> +{
>> +	unsigned int nr_pages = folio_nr_pages(folio);
> 
> const would be good
> 
>> +	pte_t entry = folio_mk_pte(folio, vma->vm_page_prot);
>> +
>> +	entry = pte_sw_mkyoung(entry);
>> +
>> +	if (vma->vm_flags & VM_WRITE)
>> +		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>> +	if (uffd_wp)
>> +		entry = pte_mkuffd_wp(entry);
>> +
>> +	folio_ref_add(folio, nr_pages - 1);
>> +	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
>> +	folio_add_lru_vma(folio, vma);
>> +	set_ptes(vma->vm_mm, addr, pte, entry, nr_pages);
>> +	update_mmu_cache_range(NULL, vma, addr, pte, nr_pages);
>> +}
>> +
>> +static void map_anon_folio_pte_pf(struct folio *folio, pte_t *pte,
>> +		struct vm_area_struct *vma, unsigned long addr, bool uffd_wp)
>> +{
>> +	unsigned int order = folio_order(folio);
> 
> const would be good here also!

ack on the const's

> 
>> +
>> +	map_anon_folio_pte_nopf(folio, pte, vma, addr, uffd_wp);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1 << order);
> 
> Is 1 << order strictly right here? This field is a long value, so 1L <<
> order maybe? I get nervous about these shifts...
> 
> Note that folio_large_nr_pages() uses 1L << order so that does seem
> preferable.

Ok sounds good thanks!

> 
>> +	count_mthp_stat(order, MTHP_STAT_ANON_FAULT_ALLOC);
>> +}
>> +
>>  /*
>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>   * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -5243,7 +5274,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>>  		}
>> -		goto setpte;
>> +		if (vmf_orig_pte_uffd_wp(vmf))
>> +			entry = pte_mkuffd_wp(entry);
>> +		set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> 
> How I _despise_ how uffd is implemented in mm. Feels like open coded
> nonsense spills out everywhere.
> 
> Not your fault obviously :)
> 
>> +
>> +		/* No need to invalidate - it was non-present before */
>> +		update_mmu_cache_range(vmf, vma, addr, vmf->pte,
>> +				       /*nr_pages=*/ 1);
> 
> Is there any point in passing vmf here given you pass NULL above, and it
> appears that nobody actually uses this? I guess it doesn't matter but
> seeing this immediately made we question why you set it in one, and not the
> other?
> 
> Maybe I'm mistaken and some arch uses it? Don't think so though.
> 
> Also can't we then just use update_mmu_cache() which is the single-page
> wrapper of this AFAICT? That'd make it even simpler.

> 
> Having done this, is there any reason to keep the annoying and confusing
> initial assignment of nr_pages = 1 at declaration time?
> 
> It seems that nr_pages is unconditionally assigned before it's used
> anywhere now at line 5298:
> 
> 	nr_pages = folio_nr_pages(folio);
> 	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> 	...
> 
> It's kinda weird to use nr_pages again after you go out of your way to
> avoid it using folio_nr_pages() in map_anon_folio_pte_nopf() and
> folio_order() in map_anon_folio_pte_pf().

Yes I believe so, thank you that is cleaner!  In the past I had nr_pages as a
variable but David suggested being explicit with the "1" so it would be obvious
its a single page... update_mmu_cache() should indicate the same thing.

> 
> But yeah, ok we align the address and it's yucky maybe leave for now (but
> we can definitely stop defaulting nr_pages to 1 :)
> 
>> +		goto unlock;
>>  	}
>>
>>  	/* Allocate our own private page. */
>> @@ -5267,11 +5305,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	 */
>>  	__folio_mark_uptodate(folio);
>>
>> -	entry = folio_mk_pte(folio, vma->vm_page_prot);
>> -	entry = pte_sw_mkyoung(entry);
>> -	if (vma->vm_flags & VM_WRITE)
>> -		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>> -
>>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (!vmf->pte)
>>  		goto release;
>> @@ -5293,19 +5326,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		folio_put(folio);
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>> -
>> -	folio_ref_add(folio, nr_pages - 1);
>> -	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> -	count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC);
>> -	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
>> -	folio_add_lru_vma(folio, vma);
>> -setpte:
>> -	if (vmf_orig_pte_uffd_wp(vmf))
>> -		entry = pte_mkuffd_wp(entry);
>> -	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>> -
>> -	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>> +	map_anon_folio_pte_pf(folio, vmf->pte, vma, addr,
>> +			      vmf_orig_pte_uffd_wp(vmf));
> 
> So we're going from:
> 
> entry = folio_mk_pte(...)
> entry = pte_sw_mkyoung(...)
> if (write)
> 	entry = pte_mkwrite(also dirty...)
> folio_ref_add(nr_pages - 1)
> add_mm_counter(... MM_ANON_PAGES, nr_pages)
> count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC)
> folio_add_new_anon_rmap(.., RMAP_EXCLUSIVE)
> folio_add_lru_vma(folio, vma)
> if (vmf_orig_pte_uffd_wp(vmf))
> 	entry = pte_mkuffd_wp(entry)
> set_ptes(mm, addr, vmf->pte, entry, nr_pages)
> update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages)
> 
> To:
> 
> <map_anon_folio_pte_nopf>
> entry = folio_mk_pte(...)
> entry = pte_sw_mkyoung(...)
> if (write)
> 	entry = pte_mkwrite(also dirty...)
> if (vmf_orig_pte_uffd_wp(vmf) <passed in via uffd_wp>) <-- reordered
> 	entry = pte_mkuffd_wp(entry)
> folio_ref_add(nr_pages - 1)
> folio_add_new_anon_rmap(.., RMAP_EXCLUSIVE)
> folio_add_lru_vma(folio, vma)
> set_ptes(mm, addr, pte, entry, nr_pages)
> update_mmu_cache_range(NULL, vma, addr, pte, nr_pages)
> 
> <map_anon_folio_pte_pf>
> add_mm_counter(... MM_ANON_PAGES, nr_pages) <-- reordered
> count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC) <-- reodrdered
> 
> But the reorderings seem fine, and it is achieving the same thing.
> 
> All the parameters being passed seem correct too.

Thanks for the review and verifying the logic :) I was particularly scared of
changing the page fault handler, so im glad multiple people have confirmed this
seems fine.

Cheers,
-- Nico

> 
>>  unlock:
>>  	if (vmf->pte)
>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> --
>> 2.53.0
>>
> 
> Cheers, Lorenzo
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH mm-unstable v3 2/5] mm: introduce is_pmd_order helper
  2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
  2026-03-11 21:13 ` [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers Nico Pache
@ 2026-03-11 21:13 ` Nico Pache
  2026-03-11 21:13 ` [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1 Nico Pache
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2026-03-11 21:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jackmanb, jack, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

In order to add mTHP support to khugepaged, we will often be checking if a
given order is (or is not) a PMD order. Some places in the kernel already
use this check, so lets create a simple helper function to keep the code
clean and readable.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/huge_mm.h | 5 +++++
 mm/huge_memory.c        | 2 +-
 mm/khugepaged.c         | 6 +++---
 mm/memory.c             | 2 +-
 mm/mempolicy.c          | 2 +-
 mm/page_alloc.c         | 4 ++--
 mm/shmem.c              | 3 +--
 7 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfde..bd7f0e1d8094 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -771,6 +771,11 @@ static inline bool pmd_is_huge(pmd_t pmd)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool is_pmd_order(unsigned int order)
+{
+	return order == HPAGE_PMD_ORDER;
+}
+
 static inline int split_folio_to_list_to_order(struct folio *folio,
 		struct list_head *list, int new_order)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7d0a64033b18..9968cde6f820 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4122,7 +4122,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		i_mmap_unlock_read(mapping);
 out:
 	xas_destroy(&xas);
-	if (old_order == HPAGE_PMD_ORDER)
+	if (is_pmd_order(old_order))
 		count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	count_mthp_stat(old_order, !ret ? MTHP_STAT_SPLIT : MTHP_STAT_SPLIT_FAILED);
 	return ret;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7b4680d27ab..d3bdec4ec61b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1532,7 +1532,7 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign
 	if (IS_ERR(folio))
 		return SCAN_PAGE_NULL;
 
-	if (folio_order(folio) != HPAGE_PMD_ORDER) {
+	if (!is_pmd_order(folio_order(folio))) {
 		result = SCAN_PAGE_COMPOUND;
 		goto drop_folio;
 	}
@@ -2015,7 +2015,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		 * we locked the first folio, then a THP might be there already.
 		 * This will be discovered on the first iteration.
 		 */
-		if (folio_order(folio) == HPAGE_PMD_ORDER) {
+		if (is_pmd_order(folio_order(folio))) {
 			result = SCAN_PTE_MAPPED_HUGEPAGE;
 			goto out_unlock;
 		}
@@ -2343,7 +2343,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
 			continue;
 		}
 
-		if (folio_order(folio) == HPAGE_PMD_ORDER) {
+		if (is_pmd_order(folio_order(folio))) {
 			result = SCAN_PTE_MAPPED_HUGEPAGE;
 			/*
 			 * PMD-sized THP implies that we can only try
diff --git a/mm/memory.c b/mm/memory.c
index 5c8bf1eb55f5..9ede3a31aa4e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5479,7 +5479,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
 		return ret;
 
-	if (folio_order(folio) != HPAGE_PMD_ORDER)
+	if (!is_pmd_order(folio_order(folio)))
 		return ret;
 	page = &folio->page;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e5175f1c767..e5528c35bbb8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2449,7 +2449,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
 	    /* filter "hugepage" allocation, unless from alloc_pages() */
-	    order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) {
+	    is_pmd_order(order) && ilx != NO_INTERLEAVE_INDEX) {
 		/*
 		 * For hugepage allocation and non-interleave policy which
 		 * allows the current node (or other explicitly preferred
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 75ee81445640..c500def16cc7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -652,7 +652,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	bool movable;
 	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		VM_BUG_ON(order != HPAGE_PMD_ORDER);
+		VM_BUG_ON(!is_pmd_order(order));
 
 		movable = migratetype == MIGRATE_MOVABLE;
 
@@ -684,7 +684,7 @@ static inline bool pcp_allowed_order(unsigned int order)
 	if (order <= PAGE_ALLOC_COSTLY_ORDER)
 		return true;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (order == HPAGE_PMD_ORDER)
+	if (is_pmd_order(order))
 		return true;
 #endif
 	return false;
diff --git a/mm/shmem.c b/mm/shmem.c
index d00044257401..4ecefe02881d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5532,8 +5532,7 @@ static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
 		spin_unlock(&huge_shmem_orders_lock);
 	} else if (sysfs_streq(buf, "inherit")) {
 		/* Do not override huge allocation policy with non-PMD sized mTHP */
-		if (shmem_huge == SHMEM_HUGE_FORCE &&
-		    order != HPAGE_PMD_ORDER)
+		if (shmem_huge == SHMEM_HUGE_FORCE && !is_pmd_order(order))
 			return -EINVAL;
 
 		spin_lock(&huge_shmem_orders_lock);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1
  2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
  2026-03-11 21:13 ` [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers Nico Pache
  2026-03-11 21:13 ` [PATCH mm-unstable v3 2/5] mm: introduce is_pmd_order helper Nico Pache
@ 2026-03-11 21:13 ` Nico Pache
  2026-03-16 18:18   ` Lorenzo Stoakes (Oracle)
  2026-03-11 21:13 ` [PATCH mm-unstable v3 4/5] mm/khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Nico Pache @ 2026-03-11 21:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jackmanb, jack, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

The value (HPAGE_PMD_NR - 1) is used often in the khugepaged code to
signify the limit of the max_ptes_* values. Add a define for this to
increase code readability and reuse.

Acked-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3bdec4ec61b..db77ab5b315e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -89,6 +89,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  *
  * Note that these are only respected if collapse was initiated by khugepaged.
  */
+#define KHUGEPAGED_MAX_PTES_LIMIT (HPAGE_PMD_NR - 1)
 unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly;
 static unsigned int khugepaged_max_ptes_shared __read_mostly;
@@ -259,7 +260,7 @@ static ssize_t max_ptes_none_store(struct kobject *kobj,
 	unsigned long max_ptes_none;
 
 	err = kstrtoul(buf, 10, &max_ptes_none);
-	if (err || max_ptes_none > HPAGE_PMD_NR - 1)
+	if (err || max_ptes_none > KHUGEPAGED_MAX_PTES_LIMIT)
 		return -EINVAL;
 
 	khugepaged_max_ptes_none = max_ptes_none;
@@ -284,7 +285,7 @@ static ssize_t max_ptes_swap_store(struct kobject *kobj,
 	unsigned long max_ptes_swap;
 
 	err  = kstrtoul(buf, 10, &max_ptes_swap);
-	if (err || max_ptes_swap > HPAGE_PMD_NR - 1)
+	if (err || max_ptes_swap > KHUGEPAGED_MAX_PTES_LIMIT)
 		return -EINVAL;
 
 	khugepaged_max_ptes_swap = max_ptes_swap;
@@ -310,7 +311,7 @@ static ssize_t max_ptes_shared_store(struct kobject *kobj,
 	unsigned long max_ptes_shared;
 
 	err  = kstrtoul(buf, 10, &max_ptes_shared);
-	if (err || max_ptes_shared > HPAGE_PMD_NR - 1)
+	if (err || max_ptes_shared > KHUGEPAGED_MAX_PTES_LIMIT)
 		return -EINVAL;
 
 	khugepaged_max_ptes_shared = max_ptes_shared;
@@ -382,7 +383,7 @@ int __init khugepaged_init(void)
 		return -ENOMEM;
 
 	khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
-	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
+	khugepaged_max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
 	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
 	khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2;
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1
  2026-03-11 21:13 ` [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1 Nico Pache
@ 2026-03-16 18:18   ` Lorenzo Stoakes (Oracle)
  2026-03-18 16:50     ` Nico Pache
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 18:18 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Wed, Mar 11, 2026 at 03:13:13PM -0600, Nico Pache wrote:
> The value (HPAGE_PMD_NR - 1) is used often in the khugepaged code to
> signify the limit of the max_ptes_* values. Add a define for this to
> increase code readability and reuse.
>
> Acked-by: Pedro Falcato <pfalcato@suse.de>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

Hm didn't I suggest this? Or actually I can't remember :P

Anyway LGTM, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d3bdec4ec61b..db77ab5b315e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -89,6 +89,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
>   *
>   * Note that these are only respected if collapse was initiated by khugepaged.
>   */
> +#define KHUGEPAGED_MAX_PTES_LIMIT (HPAGE_PMD_NR - 1)
>  unsigned int khugepaged_max_ptes_none __read_mostly;
>  static unsigned int khugepaged_max_ptes_swap __read_mostly;
>  static unsigned int khugepaged_max_ptes_shared __read_mostly;
> @@ -259,7 +260,7 @@ static ssize_t max_ptes_none_store(struct kobject *kobj,
>  	unsigned long max_ptes_none;
>
>  	err = kstrtoul(buf, 10, &max_ptes_none);
> -	if (err || max_ptes_none > HPAGE_PMD_NR - 1)
> +	if (err || max_ptes_none > KHUGEPAGED_MAX_PTES_LIMIT)
>  		return -EINVAL;
>
>  	khugepaged_max_ptes_none = max_ptes_none;
> @@ -284,7 +285,7 @@ static ssize_t max_ptes_swap_store(struct kobject *kobj,
>  	unsigned long max_ptes_swap;
>
>  	err  = kstrtoul(buf, 10, &max_ptes_swap);
> -	if (err || max_ptes_swap > HPAGE_PMD_NR - 1)
> +	if (err || max_ptes_swap > KHUGEPAGED_MAX_PTES_LIMIT)
>  		return -EINVAL;
>
>  	khugepaged_max_ptes_swap = max_ptes_swap;
> @@ -310,7 +311,7 @@ static ssize_t max_ptes_shared_store(struct kobject *kobj,
>  	unsigned long max_ptes_shared;
>
>  	err  = kstrtoul(buf, 10, &max_ptes_shared);
> -	if (err || max_ptes_shared > HPAGE_PMD_NR - 1)
> +	if (err || max_ptes_shared > KHUGEPAGED_MAX_PTES_LIMIT)
>  		return -EINVAL;
>
>  	khugepaged_max_ptes_shared = max_ptes_shared;
> @@ -382,7 +383,7 @@ int __init khugepaged_init(void)
>  		return -ENOMEM;
>
>  	khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
> -	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
> +	khugepaged_max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>  	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
>  	khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2;
>
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1
  2026-03-16 18:18   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 16:50     ` Nico Pache
  0 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2026-03-18 16:50 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe



On 3/16/26 12:18 PM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Mar 11, 2026 at 03:13:13PM -0600, Nico Pache wrote:
>> The value (HPAGE_PMD_NR - 1) is used often in the khugepaged code to
>> signify the limit of the max_ptes_* values. Add a define for this to
>> increase code readability and reuse.
>>
>> Acked-by: Pedro Falcato <pfalcato@suse.de>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> Hm didn't I suggest this? Or actually I can't remember :P

I think you suggested most of these patches ;P Ill make sure to add your SB tag
on the followups!

> 
> Anyway LGTM, so:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Thank you!

> 
>> ---
>>  mm/khugepaged.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d3bdec4ec61b..db77ab5b315e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -89,6 +89,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
>>   *
>>   * Note that these are only respected if collapse was initiated by khugepaged.
>>   */
>> +#define KHUGEPAGED_MAX_PTES_LIMIT (HPAGE_PMD_NR - 1)
>>  unsigned int khugepaged_max_ptes_none __read_mostly;
>>  static unsigned int khugepaged_max_ptes_swap __read_mostly;
>>  static unsigned int khugepaged_max_ptes_shared __read_mostly;
>> @@ -259,7 +260,7 @@ static ssize_t max_ptes_none_store(struct kobject *kobj,
>>  	unsigned long max_ptes_none;
>>
>>  	err = kstrtoul(buf, 10, &max_ptes_none);
>> -	if (err || max_ptes_none > HPAGE_PMD_NR - 1)
>> +	if (err || max_ptes_none > KHUGEPAGED_MAX_PTES_LIMIT)
>>  		return -EINVAL;
>>
>>  	khugepaged_max_ptes_none = max_ptes_none;
>> @@ -284,7 +285,7 @@ static ssize_t max_ptes_swap_store(struct kobject *kobj,
>>  	unsigned long max_ptes_swap;
>>
>>  	err  = kstrtoul(buf, 10, &max_ptes_swap);
>> -	if (err || max_ptes_swap > HPAGE_PMD_NR - 1)
>> +	if (err || max_ptes_swap > KHUGEPAGED_MAX_PTES_LIMIT)
>>  		return -EINVAL;
>>
>>  	khugepaged_max_ptes_swap = max_ptes_swap;
>> @@ -310,7 +311,7 @@ static ssize_t max_ptes_shared_store(struct kobject *kobj,
>>  	unsigned long max_ptes_shared;
>>
>>  	err  = kstrtoul(buf, 10, &max_ptes_shared);
>> -	if (err || max_ptes_shared > HPAGE_PMD_NR - 1)
>> +	if (err || max_ptes_shared > KHUGEPAGED_MAX_PTES_LIMIT)
>>  		return -EINVAL;
>>
>>  	khugepaged_max_ptes_shared = max_ptes_shared;
>> @@ -382,7 +383,7 @@ int __init khugepaged_init(void)
>>  		return -ENOMEM;
>>
>>  	khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
>> -	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
>> +	khugepaged_max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>>  	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
>>  	khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2;
>>
>> --
>> 2.53.0
>>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH mm-unstable v3 4/5] mm/khugepaged: rename hpage_collapse_* to collapse_*
  2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
                   ` (2 preceding siblings ...)
  2026-03-11 21:13 ` [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1 Nico Pache
@ 2026-03-11 21:13 ` Nico Pache
  2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
  2026-03-11 21:34 ` [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Andrew Morton
  5 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2026-03-11 21:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jackmanb, jack, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

The hpage_collapse functions describe functions used by madvise_collapse
and khugepaged. remove the unnecessary hpage prefix to shorten the
function name.

Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 60 ++++++++++++++++++++++++-------------------------
 mm/mremap.c     |  2 +-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index db77ab5b315e..33ae56e313ed 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -395,14 +395,14 @@ void __init khugepaged_destroy(void)
 	kmem_cache_destroy(mm_slot_cache);
 }
 
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int collapse_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 {
-	return hpage_collapse_test_exit(mm) ||
+	return collapse_test_exit(mm) ||
 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
@@ -436,7 +436,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 	int wakeup;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	VM_BUG_ON_MM(collapse_test_exit(mm), mm);
 	if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
 		return;
 
@@ -490,7 +490,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (slot) {
 		/*
 		 * This is required to serialize against
-		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * collapse_test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we return all
 		 * pagetables will be destroyed) until khugepaged has finished
 		 * working on the pagetables under the mmap_lock.
@@ -585,7 +585,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 
-		/* See hpage_collapse_scan_pmd(). */
+		/* See collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
@@ -836,7 +836,7 @@ static struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool collapse_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -871,7 +871,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -890,7 +890,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -909,7 +909,7 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l
 	enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
 				 TVA_FORCED_COLLAPSE;
 
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(collapse_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -980,7 +980,7 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1066,7 +1066,7 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
-	int node = hpage_collapse_find_target_node(cc);
+	int node = collapse_find_target_node(cc);
 	struct folio *folio;
 
 	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1244,7 +1244,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	return result;
 }
 
-static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
+static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		bool *mmap_locked, struct collapse_control *cc)
 {
@@ -1365,7 +1365,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1431,7 +1431,7 @@ static void collect_mm_slot(struct mm_slot *slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (collapse_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
@@ -1786,7 +1786,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 
-		if (hpage_collapse_test_exit(mm))
+		if (collapse_test_exit(mm))
 			continue;
 
 		if (!file_backed_vma_is_retractable(vma))
@@ -2302,7 +2302,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
+static enum scan_result collapse_scan_file(struct mm_struct *mm,
 		unsigned long addr, struct file *file, pgoff_t start,
 		struct collapse_control *cc)
 {
@@ -2355,7 +2355,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			folio_put(folio);
 			break;
@@ -2409,7 +2409,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
 	return result;
 }
 
-static void khugepaged_scan_mm_slot(unsigned int progress_max,
+static void collapse_scan_mm_slot(unsigned int progress_max,
 		enum scan_result *result, struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
@@ -2443,7 +2443,7 @@ static void khugepaged_scan_mm_slot(unsigned int progress_max,
 		goto breakouterloop_mmap_lock;
 
 	cc->progress++;
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(collapse_test_exit_or_disable(mm)))
 		goto breakouterloop;
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2451,7 +2451,7 @@ static void khugepaged_scan_mm_slot(unsigned int progress_max,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+		if (unlikely(collapse_test_exit_or_disable(mm))) {
 			cc->progress++;
 			break;
 		}
@@ -2473,7 +2473,7 @@ static void khugepaged_scan_mm_slot(unsigned int progress_max,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(collapse_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2486,12 +2486,12 @@ static void khugepaged_scan_mm_slot(unsigned int progress_max,
 
 				mmap_read_unlock(mm);
 				mmap_locked = false;
-				*result = hpage_collapse_scan_file(mm,
+				*result = collapse_scan_file(mm,
 					khugepaged_scan.address, file, pgoff, cc);
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
+					if (collapse_test_exit_or_disable(mm))
 						goto breakouterloop;
 					*result = try_collapse_pte_mapped_thp(mm,
 						khugepaged_scan.address, false);
@@ -2500,7 +2500,7 @@ static void khugepaged_scan_mm_slot(unsigned int progress_max,
 					mmap_read_unlock(mm);
 				}
 			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
+				*result = collapse_scan_pmd(mm, vma,
 					khugepaged_scan.address, &mmap_locked, cc);
 			}
 
@@ -2532,7 +2532,7 @@ static void khugepaged_scan_mm_slot(unsigned int progress_max,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm, or THP got disabled.
 	 */
-	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
+	if (collapse_test_exit_or_disable(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2585,7 +2585,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 			pass_through_head++;
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
-			khugepaged_scan_mm_slot(progress_max, &result, cc);
+			collapse_scan_mm_slot(progress_max, &result, cc);
 		else
 			cc->progress = progress_max;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2830,8 +2830,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_read_unlock(mm);
 			mmap_locked = false;
 			*lock_dropped = true;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
+			result = collapse_scan_file(mm, addr, file, pgoff, cc);
 
 			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
 			    mapping_can_writeback(file->f_mapping)) {
@@ -2845,8 +2844,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			}
 			fput(file);
 		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
 		}
 		if (!mmap_locked)
 			*lock_dropped = true;
diff --git a/mm/mremap.c b/mm/mremap.c
index 2be876a70cc0..eb222af91c2d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -244,7 +244,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		goto out;
 	}
 	/*
-	 * Now new_pte is none, so hpage_collapse_scan_file() path can not find
+	 * Now new_pte is none, so collapse_scan_file() path can not find
 	 * this by traversing file->f_mapping, so there is no concurrency with
 	 * retract_page_tables(). In addition, we already hold the exclusive
 	 * mmap_lock, so this new_pte page is stable, so there is no need to get
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
                   ` (3 preceding siblings ...)
  2026-03-11 21:13 ` [PATCH mm-unstable v3 4/5] mm/khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
@ 2026-03-11 21:13 ` Nico Pache
  2026-03-12  2:04   ` Wei Yang
                     ` (4 more replies)
  2026-03-11 21:34 ` [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Andrew Morton
  5 siblings, 5 replies; 20+ messages in thread
From: Nico Pache @ 2026-03-11 21:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jackmanb, jack, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing. Create collapse_single_pmd
to increase code reuse and create an entry point to these two users.

Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. This introduces a minor behavioral change
that is most likely an undiscovered bug. The current implementation of
khugepaged tests collapse_test_exit_or_disable before calling
collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
case. By unifying these two callers madvise_collapse now also performs
this check. We also modify the return value to be SCAN_ANY_PROCESS which
properly indicates that this process is no longer valid to operate on.

By moving the madvise_collapse writeback-retry logic into the helper
function we can also avoid having to revalidate the VMA.

We also guard the khugepaged_pages_collapsed variable to ensure its only
incremented for khugepaged.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 120 +++++++++++++++++++++++++-----------------------
 1 file changed, 63 insertions(+), 57 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 33ae56e313ed..733c4a42c2ce 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2409,6 +2409,65 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
 	return result;
 }
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static enum scan_result collapse_single_pmd(unsigned long addr,
+		struct vm_area_struct *vma, bool *mmap_locked,
+		struct collapse_control *cc)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	bool triggered_wb = false;
+	enum scan_result result;
+	struct file *file;
+	pgoff_t pgoff;
+
+	if (vma_is_anonymous(vma)) {
+		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
+		goto end;
+	}
+
+	file = get_file(vma->vm_file);
+	pgoff = linear_page_index(vma, addr);
+
+	mmap_read_unlock(mm);
+	*mmap_locked = false;
+retry:
+	result = collapse_scan_file(mm, addr, file, pgoff, cc);
+
+	/*
+	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
+	 * then retry the collapse one time.
+	 */
+	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
+	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
+		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+		triggered_wb = true;
+		goto retry;
+	}
+	fput(file);
+
+	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+		mmap_read_lock(mm);
+		if (collapse_test_exit_or_disable(mm))
+			result = SCAN_ANY_PROCESS;
+		else
+			result = try_collapse_pte_mapped_thp(mm, addr,
+							     !cc->is_khugepaged);
+		if (result == SCAN_PMD_MAPPED)
+			result = SCAN_SUCCEED;
+		mmap_read_unlock(mm);
+	}
+end:
+	if (cc->is_khugepaged && result == SCAN_SUCCEED)
+		++khugepaged_pages_collapsed;
+	return result;
+}
+
 static void collapse_scan_mm_slot(unsigned int progress_max,
 		enum scan_result *result, struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2479,34 +2538,9 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (!vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
-
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = try_collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
 
+			*result = collapse_single_pmd(khugepaged_scan.address,
+						      vma, &mmap_locked, cc);
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			if (!mmap_locked)
@@ -2806,9 +2840,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		enum scan_result result = SCAN_FAIL;
-		bool triggered_wb = false;
 
-retry:
 		if (!mmap_locked) {
 			cond_resched();
 			mmap_read_lock(mm);
@@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
 		}
 		mmap_assert_locked(mm);
-		if (!vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			*lock_dropped = true;
-			result = collapse_scan_file(mm, addr, file, pgoff, cc);
-
-			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
-			    mapping_can_writeback(file->f_mapping)) {
-				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
-				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
 
-				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
-				triggered_wb = true;
-				fput(file);
-				goto retry;
-			}
-			fput(file);
-		} else {
-			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
-		}
 		if (!mmap_locked)
 			*lock_dropped = true;
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
-		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			mmap_read_lock(mm);
-			result = try_collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_NO_PTE_TABLE:
+		case SCAN_PTE_MAPPED_HUGEPAGE:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
 		case SCAN_LACK_REFERENCED_PAGE:
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
@ 2026-03-12  2:04   ` Wei Yang
  2026-03-18 16:54     ` Nico Pache
  2026-03-12  9:27   ` David Hildenbrand (Arm)
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Wei Yang @ 2026-03-12  2:04 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Wed, Mar 11, 2026 at 03:13:15PM -0600, Nico Pache wrote:
[..]
>@@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> 		}
> 		mmap_assert_locked(mm);
>-		if (!vma_is_anonymous(vma)) {
>-			struct file *file = get_file(vma->vm_file);
>-			pgoff_t pgoff = linear_page_index(vma, addr);
> 
>-			mmap_read_unlock(mm);
>-			mmap_locked = false;
>-			*lock_dropped = true;
>-			result = collapse_scan_file(mm, addr, file, pgoff, cc);
>-
>-			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>-			    mapping_can_writeback(file->f_mapping)) {
>-				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>-				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>+		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> 
>-				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>-				triggered_wb = true;
>-				fput(file);
>-				goto retry;
>-			}
>-			fput(file);
>-		} else {
>-			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
>-		}
> 		if (!mmap_locked)
> 			*lock_dropped = true;
> 
>-handle_result:
> 		switch (result) {
> 		case SCAN_SUCCEED:
> 		case SCAN_PMD_MAPPED:
> 			++thps;
> 			break;
>-		case SCAN_PTE_MAPPED_HUGEPAGE:
>-			BUG_ON(mmap_locked);
>-			mmap_read_lock(mm);
>-			result = try_collapse_pte_mapped_thp(mm, addr, true);
>-			mmap_read_unlock(mm);
>-			goto handle_result;
> 		/* Whitelisted set of results where continuing OK */
> 		case SCAN_NO_PTE_TABLE:
>+		case SCAN_PTE_MAPPED_HUGEPAGE:

It looks we won't have this case after refactor?

Current code flow is like this:

  result = collapse_single_pmd()
      result = collapse_scan_file()
          result = collapse_file()

      if (result == SCAN_PTE_MAPPED_HUGEPAGE) {            --- (1)
          result = SCAN_ANY_PROCESS;
	  or 
	  result = try_collapse_pte_mapped_thp();
      }

Only collapse_scan_file() and collapse_file() may return
SCAN_PTE_MAPPED_HUGEPAGE, and then handled in (1). After this, result is set
to another value to indicate whether we collapse it or not.

So I am afraid we don't expect to see SCAN_PTE_MAPPED_HUGEPAGE here. Do I miss
something?

> 		case SCAN_PTE_NON_PRESENT:
> 		case SCAN_PTE_UFFD_WP:
> 		case SCAN_LACK_REFERENCED_PAGE:
>-- 
>2.53.0
>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-12  2:04   ` Wei Yang
@ 2026-03-18 16:54     ` Nico Pache
  2026-03-20  7:53       ` Wei Yang
  0 siblings, 1 reply; 20+ messages in thread
From: Nico Pache @ 2026-03-18 16:54 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe



On 3/11/26 8:04 PM, Wei Yang wrote:
> On Wed, Mar 11, 2026 at 03:13:15PM -0600, Nico Pache wrote:
> [..]
>> @@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>> 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>> 		}
>> 		mmap_assert_locked(mm);
>> -		if (!vma_is_anonymous(vma)) {
>> -			struct file *file = get_file(vma->vm_file);
>> -			pgoff_t pgoff = linear_page_index(vma, addr);
>>
>> -			mmap_read_unlock(mm);
>> -			mmap_locked = false;
>> -			*lock_dropped = true;
>> -			result = collapse_scan_file(mm, addr, file, pgoff, cc);
>> -
>> -			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>> -			    mapping_can_writeback(file->f_mapping)) {
>> -				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>> -				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>> +		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
>>
>> -				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>> -				triggered_wb = true;
>> -				fput(file);
>> -				goto retry;
>> -			}
>> -			fput(file);
>> -		} else {
>> -			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
>> -		}
>> 		if (!mmap_locked)
>> 			*lock_dropped = true;
>>
>> -handle_result:
>> 		switch (result) {
>> 		case SCAN_SUCCEED:
>> 		case SCAN_PMD_MAPPED:
>> 			++thps;
>> 			break;
>> -		case SCAN_PTE_MAPPED_HUGEPAGE:
>> -			BUG_ON(mmap_locked);
>> -			mmap_read_lock(mm);
>> -			result = try_collapse_pte_mapped_thp(mm, addr, true);
>> -			mmap_read_unlock(mm);
>> -			goto handle_result;
>> 		/* Whitelisted set of results where continuing OK */
>> 		case SCAN_NO_PTE_TABLE:
>> +		case SCAN_PTE_MAPPED_HUGEPAGE:
> 
> It looks we won't have this case after refactor?
> 
> Current code flow is like this:
> 
>   result = collapse_single_pmd()
>       result = collapse_scan_file()
>           result = collapse_file()
> 
>       if (result == SCAN_PTE_MAPPED_HUGEPAGE) {            --- (1)
>           result = SCAN_ANY_PROCESS;
> 	  or 
> 	  result = try_collapse_pte_mapped_thp();
>       }
> 
> Only collapse_scan_file() and collapse_file() may return
> SCAN_PTE_MAPPED_HUGEPAGE, and then handled in (1). After this, result is set
> to another value to indicate whether we collapse it or not.
> 
> So I am afraid we don't expect to see SCAN_PTE_MAPPED_HUGEPAGE here. Do I miss
> something?

No your assessment is correct, should I remove it from the list? I've been quite
confused about requests to list all the available ENUMs, does that mean we want
all the enums that are reachable or all the enums that are available as a
result? Im guessing the former based on your comment.

Cheers,
-- Nico

> 
>> 		case SCAN_PTE_NON_PRESENT:
>> 		case SCAN_PTE_UFFD_WP:
>> 		case SCAN_LACK_REFERENCED_PAGE:
>> -- 
>> 2.53.0
>>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-18 16:54     ` Nico Pache
@ 2026-03-20  7:53       ` Wei Yang
  0 siblings, 0 replies; 20+ messages in thread
From: Wei Yang @ 2026-03-20  7:53 UTC (permalink / raw)
  To: Nico Pache
  Cc: Wei Yang, linux-kernel, linux-mm, aarcange, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Wed, Mar 18, 2026 at 10:54:17AM -0600, Nico Pache wrote:
>
>
>On 3/11/26 8:04 PM, Wei Yang wrote:
>> On Wed, Mar 11, 2026 at 03:13:15PM -0600, Nico Pache wrote:
>> [..]
>>> @@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>> 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>>> 		}
>>> 		mmap_assert_locked(mm);
>>> -		if (!vma_is_anonymous(vma)) {
>>> -			struct file *file = get_file(vma->vm_file);
>>> -			pgoff_t pgoff = linear_page_index(vma, addr);
>>>
>>> -			mmap_read_unlock(mm);
>>> -			mmap_locked = false;
>>> -			*lock_dropped = true;
>>> -			result = collapse_scan_file(mm, addr, file, pgoff, cc);
>>> -
>>> -			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>> -			    mapping_can_writeback(file->f_mapping)) {
>>> -				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>>> -				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>>> +		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
>>>
>>> -				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>>> -				triggered_wb = true;
>>> -				fput(file);
>>> -				goto retry;
>>> -			}
>>> -			fput(file);
>>> -		} else {
>>> -			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
>>> -		}
>>> 		if (!mmap_locked)
>>> 			*lock_dropped = true;
>>>
>>> -handle_result:
>>> 		switch (result) {
>>> 		case SCAN_SUCCEED:
>>> 		case SCAN_PMD_MAPPED:
>>> 			++thps;
>>> 			break;
>>> -		case SCAN_PTE_MAPPED_HUGEPAGE:
>>> -			BUG_ON(mmap_locked);
>>> -			mmap_read_lock(mm);
>>> -			result = try_collapse_pte_mapped_thp(mm, addr, true);
>>> -			mmap_read_unlock(mm);
>>> -			goto handle_result;
>>> 		/* Whitelisted set of results where continuing OK */
>>> 		case SCAN_NO_PTE_TABLE:
>>> +		case SCAN_PTE_MAPPED_HUGEPAGE:
>> 
>> It looks we won't have this case after refactor?
>> 
>> Current code flow is like this:
>> 
>>   result = collapse_single_pmd()
>>       result = collapse_scan_file()
>>           result = collapse_file()
>> 
>>       if (result == SCAN_PTE_MAPPED_HUGEPAGE) {            --- (1)
>>           result = SCAN_ANY_PROCESS;
>> 	  or 
>> 	  result = try_collapse_pte_mapped_thp();
>>       }
>> 
>> Only collapse_scan_file() and collapse_file() may return
>> SCAN_PTE_MAPPED_HUGEPAGE, and then handled in (1). After this, result is set
>> to another value to indicate whether we collapse it or not.
>> 
>> So I am afraid we don't expect to see SCAN_PTE_MAPPED_HUGEPAGE here. Do I miss
>> something?
>
>No your assessment is correct, should I remove it from the list? I've been quite
>confused about requests to list all the available ENUMs, does that mean we want
>all the enums that are reachable or all the enums that are available as a
>result? Im guessing the former based on your comment.
>

If it is me, I would remove it :-) Otherwise, I will think
collapse_single_pmd() may return SCAN_PTE_MAPPED_HUGEPAGE.

But, yeah, this is really trivial.

>Cheers,
>-- Nico
>
>> 
>>> 		case SCAN_PTE_NON_PRESENT:
>>> 		case SCAN_PTE_UFFD_WP:
>>> 		case SCAN_LACK_REFERENCED_PAGE:
>>> -- 
>>> 2.53.0
>>>
>> 

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
  2026-03-12  2:04   ` Wei Yang
@ 2026-03-12  9:27   ` David Hildenbrand (Arm)
  2026-03-13  8:33   ` Baolin Wang
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12  9:27 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jackmanb, jack, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 3/11/26 22:13, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing. Create collapse_single_pmd
> to increase code reuse and create an entry point to these two users.
> 
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check. We also modify the return value to be SCAN_ANY_PROCESS which
> properly indicates that this process is no longer valid to operate on.
> 
> By moving the madvise_collapse writeback-retry logic into the helper
> function we can also avoid having to revalidate the VMA.
> 
> We also guard the khugepaged_pages_collapsed variable to ensure its only
> incremented for khugepaged.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 120 +++++++++++++++++++++++++-----------------------
>  1 file changed, 63 insertions(+), 57 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 33ae56e313ed..733c4a42c2ce 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2409,6 +2409,65 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  	return result;
>  }
>  
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static enum scan_result collapse_single_pmd(unsigned long addr,
> +		struct vm_area_struct *vma, bool *mmap_locked,
> +		struct collapse_control *cc)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	bool triggered_wb = false;
> +	enum scan_result result;
> +	struct file *file;
> +	pgoff_t pgoff;
> +
> +	if (vma_is_anonymous(vma)) {
> +		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> +		goto end;
> +	}
> +
> +	file = get_file(vma->vm_file);
> +	pgoff = linear_page_index(vma, addr);
> +
> +	mmap_read_unlock(mm);
> +	*mmap_locked = false;
> +retry:
> +	result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +
> +	/*
> +	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
> +	 * then retry the collapse one time.
> +	 */
> +	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> +	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
> +		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> +		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> +		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> +		triggered_wb = true;
> +		goto retry;
> +	}
> +	fput(file);
> +
> +	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +		mmap_read_lock(mm);
> +		if (collapse_test_exit_or_disable(mm))
> +			result = SCAN_ANY_PROCESS;
> +		else
> +			result = try_collapse_pte_mapped_thp(mm, addr,
> +							     !cc->is_khugepaged);
> +		if (result == SCAN_PMD_MAPPED)
> +			result = SCAN_SUCCEED;
> +		mmap_read_unlock(mm);
> +	}
> +end:
> +	if (cc->is_khugepaged && result == SCAN_SUCCEED)
> +		++khugepaged_pages_collapsed;
> +	return result;
> +}
> +
>  static void collapse_scan_mm_slot(unsigned int progress_max,
>  		enum scan_result *result, struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
> @@ -2479,34 +2538,9 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>  				  hend);
> -			if (!vma_is_anonymous(vma)) {
> -				struct file *file = get_file(vma->vm_file);
> -				pgoff_t pgoff = linear_page_index(vma,
> -						khugepaged_scan.address);
> -
> -				mmap_read_unlock(mm);
> -				mmap_locked = false;
> -				*result = collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> -				fput(file);
> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -					mmap_read_lock(mm);
> -					if (collapse_test_exit_or_disable(mm))
> -						goto breakouterloop;
> -					*result = try_collapse_pte_mapped_thp(mm,
> -						khugepaged_scan.address, false);
> -					if (*result == SCAN_PMD_MAPPED)
> -						*result = SCAN_SUCCEED;
> -					mmap_read_unlock(mm);
> -				}
> -			} else {
> -				*result = collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> -			}
> -
> -			if (*result == SCAN_SUCCEED)
> -				++khugepaged_pages_collapsed;
>  
> +			*result = collapse_single_pmd(khugepaged_scan.address,
> +						      vma, &mmap_locked, cc);
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
>  			if (!mmap_locked)
> @@ -2806,9 +2840,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  
>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>  		enum scan_result result = SCAN_FAIL;
> -		bool triggered_wb = false;
>  
> -retry:
>  		if (!mmap_locked) {
>  			cond_resched();
>  			mmap_read_lock(mm);
> @@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>  		}
>  		mmap_assert_locked(mm);
> -		if (!vma_is_anonymous(vma)) {
> -			struct file *file = get_file(vma->vm_file);
> -			pgoff_t pgoff = linear_page_index(vma, addr);
>  
> -			mmap_read_unlock(mm);
> -			mmap_locked = false;
> -			*lock_dropped = true;

Okay, we can get rid of this because collapse_single_pmd() will never
drop the lock, to retake it, returning with "mmap_locked == true".

If it dropped the lock, even if it relocked, it will always return with
the lock dropped and "mmap_locked == false".

The "if (!mmap_locked)" will properly set "*lock_dropped = true;".

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
  2026-03-12  2:04   ` Wei Yang
  2026-03-12  9:27   ` David Hildenbrand (Arm)
@ 2026-03-13  8:33   ` Baolin Wang
  2026-03-15 15:16   ` Lance Yang
  2026-03-16 18:54   ` Lorenzo Stoakes (Oracle)
  4 siblings, 0 replies; 20+ messages in thread
From: Baolin Wang @ 2026-03-13  8:33 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe



On 3/12/26 5:13 AM, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing. Create collapse_single_pmd
> to increase code reuse and create an entry point to these two users.
> 
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check. We also modify the return value to be SCAN_ANY_PROCESS which
> properly indicates that this process is no longer valid to operate on.
> 
> By moving the madvise_collapse writeback-retry logic into the helper
> function we can also avoid having to revalidate the VMA.
> 
> We also guard the khugepaged_pages_collapsed variable to ensure its only
> incremented for khugepaged.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
                     ` (2 preceding siblings ...)
  2026-03-13  8:33   ` Baolin Wang
@ 2026-03-15 15:16   ` Lance Yang
  2026-03-16 18:54   ` Lorenzo Stoakes (Oracle)
  4 siblings, 0 replies; 20+ messages in thread
From: Lance Yang @ 2026-03-15 15:16 UTC (permalink / raw)
  To: npache
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe


On Wed, Mar 11, 2026 at 03:13:15PM -0600, Nico Pache wrote:
>The khugepaged daemon and madvise_collapse have two different
>implementations that do almost the same thing. Create collapse_single_pmd
>to increase code reuse and create an entry point to these two users.
>
>Refactor madvise_collapse and collapse_scan_mm_slot to use the new
>collapse_single_pmd function. This introduces a minor behavioral change
>that is most likely an undiscovered bug. The current implementation of
>khugepaged tests collapse_test_exit_or_disable before calling
>collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
>case. By unifying these two callers madvise_collapse now also performs
>this check. We also modify the return value to be SCAN_ANY_PROCESS which
>properly indicates that this process is no longer valid to operate on.
>
>By moving the madvise_collapse writeback-retry logic into the helper
>function we can also avoid having to revalidate the VMA.
>
>We also guard the khugepaged_pages_collapsed variable to ensure its only
>incremented for khugepaged.
>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 120 +++++++++++++++++++++++++-----------------------
> 1 file changed, 63 insertions(+), 57 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 33ae56e313ed..733c4a42c2ce 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -2409,6 +2409,65 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> 	return result;
> }
> 
>+/*
>+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
>+ * the results.
>+ */
>+static enum scan_result collapse_single_pmd(unsigned long addr,
>+		struct vm_area_struct *vma, bool *mmap_locked,
>+		struct collapse_control *cc)
>+{
>+	struct mm_struct *mm = vma->vm_mm;
>+	bool triggered_wb = false;
>+	enum scan_result result;
>+	struct file *file;
>+	pgoff_t pgoff;
>+
>+	if (vma_is_anonymous(vma)) {
>+		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
>+		goto end;
>+	}
>+
>+	file = get_file(vma->vm_file);
>+	pgoff = linear_page_index(vma, addr);
>+
>+	mmap_read_unlock(mm);
>+	*mmap_locked = false;
>+retry:
>+	result = collapse_scan_file(mm, addr, file, pgoff, cc);
>+
>+	/*
>+	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
>+	 * then retry the collapse one time.
>+	 */
>+	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
>+	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
>+		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>+		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>+
>+		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>+		triggered_wb = true;
>+		goto retry;

While the old retry path did go back through hugepage_vma_revalidate(),
the retry itself is not relying on the original VMA remaining unchanged
IIUC.

After dropping mmap_lock, the code still holds a reference to the file,
so no lifetime issue should arise here :)

So, LGTM!
Reviewed-by: Lance Yang <lance.yang@linux.dev>

Cheers,
Lance


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
                     ` (3 preceding siblings ...)
  2026-03-15 15:16   ` Lance Yang
@ 2026-03-16 18:54   ` Lorenzo Stoakes (Oracle)
  2026-03-18 17:22     ` Nico Pache
  4 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 18:54 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Wed, Mar 11, 2026 at 03:13:15PM -0600, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing. Create collapse_single_pmd
> to increase code reuse and create an entry point to these two users.

Ah this is nice :) Thanks!

>
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check. We also modify the return value to be SCAN_ANY_PROCESS which
> properly indicates that this process is no longer valid to operate on.
>
> By moving the madvise_collapse writeback-retry logic into the helper
> function we can also avoid having to revalidate the VMA.
>
> We also guard the khugepaged_pages_collapsed variable to ensure its only
> incremented for khugepaged.
>
> Signed-off-by: Nico Pache <npache@redhat.com>

The logic all seems correct to me, just a bunch of nits below really. This is
a really nice refactoring! :)

With them addressed:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Cheers, Lorenzo

> ---
>  mm/khugepaged.c | 120 +++++++++++++++++++++++++-----------------------
>  1 file changed, 63 insertions(+), 57 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 33ae56e313ed..733c4a42c2ce 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2409,6 +2409,65 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  	return result;
>  }
>
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static enum scan_result collapse_single_pmd(unsigned long addr,
> +		struct vm_area_struct *vma, bool *mmap_locked,

mmap_locked seems mildly pointless here, and it's a semi-code smell to pass 'is
locked' flags I think.

You never read this, but the parameter implies somebody might pass in mmaplocked
== false, but you know it's always true here.

Anyway I think it makes more sense to pass in lock_dropped and get rid of
mmap_locked in madvise_collapse() and just pass in lock_dropped directly
(setting it false if anon).

Also obviously update collapse_scan_mm_slot() to use lock_dropped instead just
inverted.

That's clearer I think since it makes it a verb rather than a noun and the
function is dictating whether or not the lock is dropped, it also implies the
lock is held on entry.

> +		struct collapse_control *cc)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	bool triggered_wb = false;
> +	enum scan_result result;
> +	struct file *file;
> +	pgoff_t pgoff;
> +

Maybe move the mmap_assert_locked() from madvise_collapse() to here? Then we
assert it in both cases.

> +	if (vma_is_anonymous(vma)) {
> +		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> +		goto end;
> +	}
> +
> +	file = get_file(vma->vm_file);
> +	pgoff = linear_page_index(vma, addr);
> +
> +	mmap_read_unlock(mm);
> +	*mmap_locked = false;
> +retry:
> +	result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +
> +	/*
> +	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
> +	 * then retry the collapse one time.
> +	 */
> +	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> +	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
> +		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> +		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> +		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> +		triggered_wb = true;
> +		goto retry;

Thinking through this logic I do agree that we don't need to revalidate here,
which should be quite a nice win, I just don't know why we previously assumed
we'd have to... or maybe it was just because it became too spaghetti to goto
around it somehow??

> +	}
> +	fput(file);
> +
> +	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +		mmap_read_lock(mm);
> +		if (collapse_test_exit_or_disable(mm))
> +			result = SCAN_ANY_PROCESS;
> +		else
> +			result = try_collapse_pte_mapped_thp(mm, addr,
> +							     !cc->is_khugepaged);
> +		if (result == SCAN_PMD_MAPPED)
> +			result = SCAN_SUCCEED;
> +		mmap_read_unlock(mm);
> +	}
> +end:
> +	if (cc->is_khugepaged && result == SCAN_SUCCEED)
> +		++khugepaged_pages_collapsed;
> +	return result;
> +}
> +
>  static void collapse_scan_mm_slot(unsigned int progress_max,
>  		enum scan_result *result, struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
> @@ -2479,34 +2538,9 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>  				  hend);

Nice-to-have, but could we convert these VM_BUG_ON()'s to VM_WARN_ON_ONCE()'s
while we're passing?

> -			if (!vma_is_anonymous(vma)) {
> -				struct file *file = get_file(vma->vm_file);
> -				pgoff_t pgoff = linear_page_index(vma,
> -						khugepaged_scan.address);
> -
> -				mmap_read_unlock(mm);
> -				mmap_locked = false;
> -				*result = collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> -				fput(file);
> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -					mmap_read_lock(mm);
> -					if (collapse_test_exit_or_disable(mm))
> -						goto breakouterloop;
> -					*result = try_collapse_pte_mapped_thp(mm,
> -						khugepaged_scan.address, false);
> -					if (*result == SCAN_PMD_MAPPED)
> -						*result = SCAN_SUCCEED;
> -					mmap_read_unlock(mm);
> -				}
> -			} else {
> -				*result = collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> -			}
> -
> -			if (*result == SCAN_SUCCEED)
> -				++khugepaged_pages_collapsed;
>
> +			*result = collapse_single_pmd(khugepaged_scan.address,
> +						      vma, &mmap_locked, cc);
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
>  			if (!mmap_locked)
> @@ -2806,9 +2840,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>  		enum scan_result result = SCAN_FAIL;
> -		bool triggered_wb = false;
>
> -retry:
>  		if (!mmap_locked) {
>  			cond_resched();
>  			mmap_read_lock(mm);
> @@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>  		}
>  		mmap_assert_locked(mm);
> -		if (!vma_is_anonymous(vma)) {
> -			struct file *file = get_file(vma->vm_file);
> -			pgoff_t pgoff = linear_page_index(vma, addr);
>
> -			mmap_read_unlock(mm);
> -			mmap_locked = false;
> -			*lock_dropped = true;
> -			result = collapse_scan_file(mm, addr, file, pgoff, cc);
> -
> -			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> -			    mapping_can_writeback(file->f_mapping)) {
> -				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> -				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
>
> -				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> -				triggered_wb = true;
> -				fput(file);
> -				goto retry;
> -			}
> -			fput(file);
> -		} else {
> -			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> -		}
>  		if (!mmap_locked)
>  			*lock_dropped = true;
>
> -handle_result:
>  		switch (result) {
>  		case SCAN_SUCCEED:
>  		case SCAN_PMD_MAPPED:
>  			++thps;
>  			break;
> -		case SCAN_PTE_MAPPED_HUGEPAGE:
> -			BUG_ON(mmap_locked);
> -			mmap_read_lock(mm);
> -			result = try_collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_read_unlock(mm);
> -			goto handle_result;
>  		/* Whitelisted set of results where continuing OK */
>  		case SCAN_NO_PTE_TABLE:
> +		case SCAN_PTE_MAPPED_HUGEPAGE:
>  		case SCAN_PTE_NON_PRESENT:
>  		case SCAN_PTE_UFFD_WP:
>  		case SCAN_LACK_REFERENCED_PAGE:
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-16 18:54   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 17:22     ` Nico Pache
  2026-03-19 16:01       ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 20+ messages in thread
From: Nico Pache @ 2026-03-18 17:22 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe



On 3/16/26 12:54 PM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Mar 11, 2026 at 03:13:15PM -0600, Nico Pache wrote:
>> The khugepaged daemon and madvise_collapse have two different
>> implementations that do almost the same thing. Create collapse_single_pmd
>> to increase code reuse and create an entry point to these two users.
> 
> Ah this is nice :) Thanks!

Thanks :) hopefully more khugepaged cleanups to come after these series' land.

> 
>>
>> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
>> collapse_single_pmd function. This introduces a minor behavioral change
>> that is most likely an undiscovered bug. The current implementation of
>> khugepaged tests collapse_test_exit_or_disable before calling
>> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
>> case. By unifying these two callers madvise_collapse now also performs
>> this check. We also modify the return value to be SCAN_ANY_PROCESS which
>> properly indicates that this process is no longer valid to operate on.
>>
>> By moving the madvise_collapse writeback-retry logic into the helper
>> function we can also avoid having to revalidate the VMA.
>>
>> We also guard the khugepaged_pages_collapsed variable to ensure its only
>> incremented for khugepaged.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> The logic all seems correct to me, just a bunch of nits below really. This is
> a really nice refactoring! :)
> 
> With them addressed:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Thanks I will address those!

> 
> Cheers, Lorenzo
> 
>> ---
>>  mm/khugepaged.c | 120 +++++++++++++++++++++++++-----------------------
>>  1 file changed, 63 insertions(+), 57 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 33ae56e313ed..733c4a42c2ce 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2409,6 +2409,65 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>>  	return result;
>>  }
>>
>> +/*
>> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
>> + * the results.
>> + */
>> +static enum scan_result collapse_single_pmd(unsigned long addr,
>> +		struct vm_area_struct *vma, bool *mmap_locked,
> 
> mmap_locked seems mildly pointless here, and it's a semi-code smell to pass 'is
> locked' flags I think.
> 
> You never read this, but the parameter implies somebody might pass in mmaplocked
> == false, but you know it's always true here.
> 
> Anyway I think it makes more sense to pass in lock_dropped and get rid of
> mmap_locked in madvise_collapse() and just pass in lock_dropped directly
> (setting it false if anon).
> 
> Also obviously update collapse_scan_mm_slot() to use lock_dropped instead just
> inverted.
> 
> That's clearer I think since it makes it a verb rather than a noun and the
> function is dictating whether or not the lock is dropped, it also implies the
> lock is held on entry.

Ok I will give this a shot!

> 
>> +		struct collapse_control *cc)
>> +{
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	bool triggered_wb = false;
>> +	enum scan_result result;
>> +	struct file *file;
>> +	pgoff_t pgoff;
>> +
> 
> Maybe move the mmap_assert_locked() from madvise_collapse() to here? Then we
> assert it in both cases.

ack, Sounds like a good idea!

> 
>> +	if (vma_is_anonymous(vma)) {
>> +		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
>> +		goto end;
>> +	}
>> +
>> +	file = get_file(vma->vm_file);
>> +	pgoff = linear_page_index(vma, addr);
>> +
>> +	mmap_read_unlock(mm);
>> +	*mmap_locked = false;
>> +retry:
>> +	result = collapse_scan_file(mm, addr, file, pgoff, cc);
>> +
>> +	/*
>> +	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
>> +	 * then retry the collapse one time.
>> +	 */
>> +	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
>> +	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
>> +		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>> +		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>> +
>> +		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>> +		triggered_wb = true;
>> +		goto retry;
> 
> Thinking through this logic I do agree that we don't need to revalidate here,
> which should be quite a nice win, I just don't know why we previously assumed
> we'd have to... or maybe it was just because it became too spaghetti to goto
> around it somehow??

I believe the latter, the retry went at the top of the loop, and the
revalidation was already being done.

> 
>> +	}
>> +	fput(file);
>> +
>> +	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>> +		mmap_read_lock(mm);
>> +		if (collapse_test_exit_or_disable(mm))
>> +			result = SCAN_ANY_PROCESS;
>> +		else
>> +			result = try_collapse_pte_mapped_thp(mm, addr,
>> +							     !cc->is_khugepaged);
>> +		if (result == SCAN_PMD_MAPPED)
>> +			result = SCAN_SUCCEED;
>> +		mmap_read_unlock(mm);
>> +	}
>> +end:
>> +	if (cc->is_khugepaged && result == SCAN_SUCCEED)
>> +		++khugepaged_pages_collapsed;
>> +	return result;
>> +}
>> +
>>  static void collapse_scan_mm_slot(unsigned int progress_max,
>>  		enum scan_result *result, struct collapse_control *cc)
>>  	__releases(&khugepaged_mm_lock)
>> @@ -2479,34 +2538,9 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
>>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>>  				  hend);
> 
> Nice-to-have, but could we convert these VM_BUG_ON()'s to VM_WARN_ON_ONCE()'s
> while we're passing?

Yeah sure, I have a question about these, because they do concern me (perhaps
out of ignorance). does a WARN_ON_ONCE stop the daemon? I would be concerned
about a rogue khugepaged instance going through and messing with page tables
when it fails some assertion. Could this not lead to serious memory/file
corruptions?

Thanks for the reviews!

Cheers,
-- Nico

> 
>> -			if (!vma_is_anonymous(vma)) {
>> -				struct file *file = get_file(vma->vm_file);
>> -				pgoff_t pgoff = linear_page_index(vma,
>> -						khugepaged_scan.address);
>> -
>> -				mmap_read_unlock(mm);
>> -				mmap_locked = false;
>> -				*result = collapse_scan_file(mm,
>> -					khugepaged_scan.address, file, pgoff, cc);
>> -				fput(file);
>> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>> -					mmap_read_lock(mm);
>> -					if (collapse_test_exit_or_disable(mm))
>> -						goto breakouterloop;
>> -					*result = try_collapse_pte_mapped_thp(mm,
>> -						khugepaged_scan.address, false);
>> -					if (*result == SCAN_PMD_MAPPED)
>> -						*result = SCAN_SUCCEED;
>> -					mmap_read_unlock(mm);
>> -				}
>> -			} else {
>> -				*result = collapse_scan_pmd(mm, vma,
>> -					khugepaged_scan.address, &mmap_locked, cc);
>> -			}
>> -
>> -			if (*result == SCAN_SUCCEED)
>> -				++khugepaged_pages_collapsed;
>>
>> +			*result = collapse_single_pmd(khugepaged_scan.address,
>> +						      vma, &mmap_locked, cc);
>>  			/* move to next address */
>>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
>>  			if (!mmap_locked)
>> @@ -2806,9 +2840,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>
>>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>  		enum scan_result result = SCAN_FAIL;
>> -		bool triggered_wb = false;
>>
>> -retry:
>>  		if (!mmap_locked) {
>>  			cond_resched();
>>  			mmap_read_lock(mm);
>> @@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>  			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>>  		}
>>  		mmap_assert_locked(mm);
>> -		if (!vma_is_anonymous(vma)) {
>> -			struct file *file = get_file(vma->vm_file);
>> -			pgoff_t pgoff = linear_page_index(vma, addr);
>>
>> -			mmap_read_unlock(mm);
>> -			mmap_locked = false;
>> -			*lock_dropped = true;
>> -			result = collapse_scan_file(mm, addr, file, pgoff, cc);
>> -
>> -			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>> -			    mapping_can_writeback(file->f_mapping)) {
>> -				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>> -				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>> +		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
>>
>> -				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>> -				triggered_wb = true;
>> -				fput(file);
>> -				goto retry;
>> -			}
>> -			fput(file);
>> -		} else {
>> -			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
>> -		}
>>  		if (!mmap_locked)
>>  			*lock_dropped = true;
>>
>> -handle_result:
>>  		switch (result) {
>>  		case SCAN_SUCCEED:
>>  		case SCAN_PMD_MAPPED:
>>  			++thps;
>>  			break;
>> -		case SCAN_PTE_MAPPED_HUGEPAGE:
>> -			BUG_ON(mmap_locked);
>> -			mmap_read_lock(mm);
>> -			result = try_collapse_pte_mapped_thp(mm, addr, true);
>> -			mmap_read_unlock(mm);
>> -			goto handle_result;
>>  		/* Whitelisted set of results where continuing OK */
>>  		case SCAN_NO_PTE_TABLE:
>> +		case SCAN_PTE_MAPPED_HUGEPAGE:
>>  		case SCAN_PTE_NON_PRESENT:
>>  		case SCAN_PTE_UFFD_WP:
>>  		case SCAN_LACK_REFERENCED_PAGE:
>> --
>> 2.53.0
>>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
  2026-03-18 17:22     ` Nico Pache
@ 2026-03-19 16:01       ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 16:01 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, david, dev.jain, gourry, hannes, hughd,
	jackmanb, jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Wed, Mar 18, 2026 at 11:22:07AM -0600, Nico Pache wrote:
>
>
> On 3/16/26 12:54 PM, Lorenzo Stoakes (Oracle) wrote:
> > On Wed, Mar 11, 2026 at 03:13:15PM -0600, Nico Pache wrote:
> >> The khugepaged daemon and madvise_collapse have two different
> >> implementations that do almost the same thing. Create collapse_single_pmd
> >> to increase code reuse and create an entry point to these two users.
> >
> > Ah this is nice :) Thanks!
>
> Thanks :) hopefully more khugepaged cleanups to come after these series' land.
>
> >
> >>
> >> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> >> collapse_single_pmd function. This introduces a minor behavioral change
> >> that is most likely an undiscovered bug. The current implementation of
> >> khugepaged tests collapse_test_exit_or_disable before calling
> >> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> >> case. By unifying these two callers madvise_collapse now also performs
> >> this check. We also modify the return value to be SCAN_ANY_PROCESS which
> >> properly indicates that this process is no longer valid to operate on.
> >>
> >> By moving the madvise_collapse writeback-retry logic into the helper
> >> function we can also avoid having to revalidate the VMA.
> >>
> >> We also guard the khugepaged_pages_collapsed variable to ensure its only
> >> incremented for khugepaged.
> >>
> >> Signed-off-by: Nico Pache <npache@redhat.com>
> >
> > The logic all seems correct to me, just a bunch of nits below really. This is
> > a really nice refactoring! :)
> >
> > With them addressed:
> >
> > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>
> Thanks I will address those!
>
> >
> > Cheers, Lorenzo
> >
> >> ---
> >>  mm/khugepaged.c | 120 +++++++++++++++++++++++++-----------------------
> >>  1 file changed, 63 insertions(+), 57 deletions(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 33ae56e313ed..733c4a42c2ce 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -2409,6 +2409,65 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> >>  	return result;
> >>  }
> >>
> >> +/*
> >> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> >> + * the results.
> >> + */
> >> +static enum scan_result collapse_single_pmd(unsigned long addr,
> >> +		struct vm_area_struct *vma, bool *mmap_locked,
> >
> > mmap_locked seems mildly pointless here, and it's a semi-code smell to pass 'is
> > locked' flags I think.
> >
> > You never read this, but the parameter implies somebody might pass in mmaplocked
> > == false, but you know it's always true here.
> >
> > Anyway I think it makes more sense to pass in lock_dropped and get rid of
> > mmap_locked in madvise_collapse() and just pass in lock_dropped directly
> > (setting it false if anon).
> >
> > Also obviously update collapse_scan_mm_slot() to use lock_dropped instead just
> > inverted.
> >
> > That's clearer I think since it makes it a verb rather than a noun and the
> > function is dictating whether or not the lock is dropped, it also implies the
> > lock is held on entry.
>
> Ok I will give this a shot!
>
> >
> >> +		struct collapse_control *cc)
> >> +{
> >> +	struct mm_struct *mm = vma->vm_mm;
> >> +	bool triggered_wb = false;
> >> +	enum scan_result result;
> >> +	struct file *file;
> >> +	pgoff_t pgoff;
> >> +
> >
> > Maybe move the mmap_assert_locked() from madvise_collapse() to here? Then we
> > assert it in both cases.
>
> ack, Sounds like a good idea!

Thanks + for above! :)

>
> >
> >> +	if (vma_is_anonymous(vma)) {
> >> +		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> >> +		goto end;
> >> +	}
> >> +
> >> +	file = get_file(vma->vm_file);
> >> +	pgoff = linear_page_index(vma, addr);
> >> +
> >> +	mmap_read_unlock(mm);
> >> +	*mmap_locked = false;
> >> +retry:
> >> +	result = collapse_scan_file(mm, addr, file, pgoff, cc);
> >> +
> >> +	/*
> >> +	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
> >> +	 * then retry the collapse one time.
> >> +	 */
> >> +	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> >> +	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
> >> +		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> >> +		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> >> +
> >> +		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> >> +		triggered_wb = true;
> >> +		goto retry;
> >
> > Thinking through this logic I do agree that we don't need to revalidate here,
> > which should be quite a nice win, I just don't know why we previously assumed
> > we'd have to... or maybe it was just because it became too spaghetti to goto
> > around it somehow??
>
> I believe the latter, the retry went at the top of the loop, and the
> revalidation was already being done.

Yeah makes sense!

>
> >
> >> +	}
> >> +	fput(file);
> >> +
> >> +	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >> +		mmap_read_lock(mm);
> >> +		if (collapse_test_exit_or_disable(mm))
> >> +			result = SCAN_ANY_PROCESS;
> >> +		else
> >> +			result = try_collapse_pte_mapped_thp(mm, addr,
> >> +							     !cc->is_khugepaged);
> >> +		if (result == SCAN_PMD_MAPPED)
> >> +			result = SCAN_SUCCEED;
> >> +		mmap_read_unlock(mm);
> >> +	}
> >> +end:
> >> +	if (cc->is_khugepaged && result == SCAN_SUCCEED)
> >> +		++khugepaged_pages_collapsed;
> >> +	return result;
> >> +}
> >> +
> >>  static void collapse_scan_mm_slot(unsigned int progress_max,
> >>  		enum scan_result *result, struct collapse_control *cc)
> >>  	__releases(&khugepaged_mm_lock)
> >> @@ -2479,34 +2538,9 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
> >>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
> >>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
> >>  				  hend);
> >
> > Nice-to-have, but could we convert these VM_BUG_ON()'s to VM_WARN_ON_ONCE()'s
> > while we're passing?
>
> Yeah sure, I have a question about these, because they do concern me (perhaps
> out of ignorance). does a WARN_ON_ONCE stop the daemon? I would be concerned
> about a rogue khugepaged instance going through and messing with page tables
> when it fails some assertion. Could this not lead to serious memory/file
> corruptions?

It won't, but all of this is in CONFIG_DEBUG_VM anyway, so it's a kernel bug if
it happens in a release kernel, this is just for visibility when debugging and
those systems should be e.g. VMS etc.

The fact this is VM_xxx and hasn't apparently fired before suggests we're good.

>
> Thanks for the reviews!
>
> Cheers,
> -- Nico
>
> >
> >> -			if (!vma_is_anonymous(vma)) {
> >> -				struct file *file = get_file(vma->vm_file);
> >> -				pgoff_t pgoff = linear_page_index(vma,
> >> -						khugepaged_scan.address);
> >> -
> >> -				mmap_read_unlock(mm);
> >> -				mmap_locked = false;
> >> -				*result = collapse_scan_file(mm,
> >> -					khugepaged_scan.address, file, pgoff, cc);
> >> -				fput(file);
> >> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >> -					mmap_read_lock(mm);
> >> -					if (collapse_test_exit_or_disable(mm))
> >> -						goto breakouterloop;
> >> -					*result = try_collapse_pte_mapped_thp(mm,
> >> -						khugepaged_scan.address, false);
> >> -					if (*result == SCAN_PMD_MAPPED)
> >> -						*result = SCAN_SUCCEED;
> >> -					mmap_read_unlock(mm);
> >> -				}
> >> -			} else {
> >> -				*result = collapse_scan_pmd(mm, vma,
> >> -					khugepaged_scan.address, &mmap_locked, cc);
> >> -			}
> >> -
> >> -			if (*result == SCAN_SUCCEED)
> >> -				++khugepaged_pages_collapsed;
> >>
> >> +			*result = collapse_single_pmd(khugepaged_scan.address,
> >> +						      vma, &mmap_locked, cc);
> >>  			/* move to next address */
> >>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> >>  			if (!mmap_locked)
> >> @@ -2806,9 +2840,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>
> >>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> >>  		enum scan_result result = SCAN_FAIL;
> >> -		bool triggered_wb = false;
> >>
> >> -retry:
> >>  		if (!mmap_locked) {
> >>  			cond_resched();
> >>  			mmap_read_lock(mm);
> >> @@ -2823,46 +2855,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>  			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> >>  		}
> >>  		mmap_assert_locked(mm);
> >> -		if (!vma_is_anonymous(vma)) {
> >> -			struct file *file = get_file(vma->vm_file);
> >> -			pgoff_t pgoff = linear_page_index(vma, addr);
> >>
> >> -			mmap_read_unlock(mm);
> >> -			mmap_locked = false;
> >> -			*lock_dropped = true;
> >> -			result = collapse_scan_file(mm, addr, file, pgoff, cc);
> >> -
> >> -			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> >> -			    mapping_can_writeback(file->f_mapping)) {
> >> -				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> >> -				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> >> +		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> >>
> >> -				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> >> -				triggered_wb = true;
> >> -				fput(file);
> >> -				goto retry;
> >> -			}
> >> -			fput(file);
> >> -		} else {
> >> -			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> >> -		}
> >>  		if (!mmap_locked)
> >>  			*lock_dropped = true;
> >>
> >> -handle_result:
> >>  		switch (result) {
> >>  		case SCAN_SUCCEED:
> >>  		case SCAN_PMD_MAPPED:
> >>  			++thps;
> >>  			break;
> >> -		case SCAN_PTE_MAPPED_HUGEPAGE:
> >> -			BUG_ON(mmap_locked);
> >> -			mmap_read_lock(mm);
> >> -			result = try_collapse_pte_mapped_thp(mm, addr, true);
> >> -			mmap_read_unlock(mm);
> >> -			goto handle_result;
> >>  		/* Whitelisted set of results where continuing OK */
> >>  		case SCAN_NO_PTE_TABLE:
> >> +		case SCAN_PTE_MAPPED_HUGEPAGE:
> >>  		case SCAN_PTE_NON_PRESENT:
> >>  		case SCAN_PTE_UFFD_WP:
> >>  		case SCAN_LACK_REFERENCED_PAGE:
> >> --
> >> 2.53.0
> >>
> >
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites
  2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
                   ` (4 preceding siblings ...)
  2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
@ 2026-03-11 21:34 ` Andrew Morton
  5 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2026-03-11 21:34 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, aarcange, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, david, dev.jain, gourry, hannes, hughd, jackmanb,
	jack, jannh, jglisse, joshua.hahnjy, kas, lance.yang,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Wed, 11 Mar 2026 15:13:10 -0600 Nico Pache <npache@redhat.com> wrote:

> The following series contains cleanups and prerequisites for my work on
> khugepaged mTHP support [1]. These have been separated out to ease review.

Thanks, I'll add[1] these to mm.git's mm-new branch for testing.  Later (a
few days) I'll hopefully move them into the mm-unstable branch, where
they will get linux-next exposure.

[1]: I'll suppress the usual added-to-mm emails.  What a lot of cc's.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-03-20  7:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers Nico Pache
2026-03-16 18:17   ` Lorenzo Stoakes (Oracle)
2026-03-18 16:48     ` Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 2/5] mm: introduce is_pmd_order helper Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1 Nico Pache
2026-03-16 18:18   ` Lorenzo Stoakes (Oracle)
2026-03-18 16:50     ` Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 4/5] mm/khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
2026-03-12  2:04   ` Wei Yang
2026-03-18 16:54     ` Nico Pache
2026-03-20  7:53       ` Wei Yang
2026-03-12  9:27   ` David Hildenbrand (Arm)
2026-03-13  8:33   ` Baolin Wang
2026-03-15 15:16   ` Lance Yang
2026-03-16 18:54   ` Lorenzo Stoakes (Oracle)
2026-03-18 17:22     ` Nico Pache
2026-03-19 16:01       ` Lorenzo Stoakes (Oracle)
2026-03-11 21:34 ` [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox