[PATCH mm-unstable v15 00/13] khugepaged: mTHP support

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH mm-unstable v15 00/13] khugepaged: mTHP support
@ 2026-02-26  3:17 Nico Pache
  2026-02-26  3:22 ` [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
                   ` (12 more replies)
  0 siblings, 13 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:17 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

This series depends on [1], which provides some cleanup and prereqs.
Some of those patches used to belong to the V14 of this series.

The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
pages that are occupied (!none/zero). After the PMD scan is done, we use
the bitmap to find the optimal mTHP sizes for the PMD range. The
restriction on max_ptes_none is removed during the scan, to make sure we
account for the whole PMD range in the bitmap. When no mTHP size is
enabled, the legacy behavior of khugepaged is maintained.

We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
(ie 511). If any other value is specified, the kernel will emit a warning
and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
but contains swapped out, or shared pages, we don't perform the collapse.
It is now also possible to collapse to mTHPs without requiring the PMD THP
size to be enabled. These limitiations are to prevent collapse "creep"
behavior. This prevents constantly promoting mTHPs to the next available
size, which would occur because a collapse introduces more non-zero pages
that would satisfy the promotion condition on subsequent scans.

Patch 1-5:   Generalize khugepaged functions for arbitrary orders and
             introduce some helper functions
Patch 6:     Skip collapsing mTHP to smaller orders
Patch 7-8:   Add per-order mTHP statistics and tracepoints
Patch 9:     Introduce collapse_allowable_orders
Patch 10-12: Introduce bitmap and mTHP collapse support, fully enabled
Patch 13:    Documentation

Testing:
- Built for x86_64, aarch64, ppc64le, and s390x
- ran all arches on test suites provided by the kernel-tests project
- internal testing suites: functional testing and performance testing
- selftests mm
- I created a test script that I used to push khugepaged to its limits
   while monitoring a number of stats and tracepoints. The code is
   available here[2] (Run in legacy mode for these changes and set mthp
   sizes to inherit)
   The summary from my testings was that there was no significant
   regression noticed through this test. In some cases my changes had
   better collapse latencies, and was able to scan more pages in the same
   amount of time/work, but for the most part the results were consistent.
- redis testing. I did some testing with these changes along with my defer
  changes (see followup [4] post for more details). We've decided to get
  the mTHP changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.

V15 changes:
- Split the series into two [1] to ease review, and keep this series
  fully khugepaged related (David, Lorenzo)
- Refactored collapse_max_ptes_none to remove the full_scan boolean arg
  moving the logic to the caller (Lorenzo)
- added /*bool=*/ comments to ambiguous function arguments (Lorenzo)
- A few changes that were requested in v14 were done in [1], such as
  introducing map_anon_folio_pte_(no)pf, defining the
  COLLAPSE_MAX_PTES_LIMIT macro, and the fixup of the writeback retry
  logic. These changes were noted in the v1 of the cleanup series [1].
  Some of these requested changes are leveraged in this series
  (is_pmd_order, DEFINE usage, and map_anon_folio_pte_(no)pf).

V14: https://lore.kernel.org/lkml/20260122192841.128719-1-npache@redhat.com/
V13: https://lore.kernel.org/lkml/20251201174627.23295-1-npache@redhat.com/
V12: https://lore.kernel.org/lkml/20251022183717.70829-1-npache@redhat.com/
V11: https://lore.kernel.org/lkml/20250912032810.197475-1-npache@redhat.com/
V10: https://lore.kernel.org/lkml/20250819134205.622806-1-npache@redhat.com/
V9 : https://lore.kernel.org/lkml/20250714003207.113275-1-npache@redhat.com/
V8 : https://lore.kernel.org/lkml/20250702055742.102808-1-npache@redhat.com/
V7 : https://lore.kernel.org/lkml/20250515032226.128900-1-npache@redhat.com/
V6 : https://lore.kernel.org/lkml/20250515030312.125567-1-npache@redhat.com/
V5 : https://lore.kernel.org/lkml/20250428181218.85925-1-npache@redhat.com/
V4 : https://lore.kernel.org/lkml/20250417000238.74567-1-npache@redhat.com/
V3 : https://lore.kernel.org/lkml/20250414220557.35388-1-npache@redhat.com/
V2 : https://lore.kernel.org/lkml/20250211003028.213461-1-npache@redhat.com/
V1 : https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/

A big thanks to everyone that has reviewed, tested, and participated in
the development process. Its been a great experience working with all of
you on this endeavour.

[1] - https://lore.kernel.org/lkml/20260226012929.169479-1-npache@redhat.com/
[2] - https://gitlab.com/npache/khugepaged_mthp_test
[3] - https://lore.kernel.org/lkml/20260212021835.17755-1-npache@redhat.com/
[4] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

Baolin Wang (1):
  mm/khugepaged: run khugepaged for all orders

Dev Jain (1):
  mm/khugepaged: generalize alloc_charge_folio()

Nico Pache (11):
  mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support
  mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  mm/khugepaged: introduce collapse_max_ptes_none helper function
  mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  mm/khugepaged: skip collapsing mTHP to smaller orders
  mm/khugepaged: add per-order mTHP collapse failure statistics
  mm/khugepaged: improve tracepoints for mTHP orders
  mm/khugepaged: introduce collapse_allowable_orders helper function
  mm/khugepaged: Introduce mTHP collapse support
  mm/khugepaged: avoid unnecessary mTHP collapse attempts
  Documentation: mm: update the admin guide for mTHP collapse

 Documentation/admin-guide/mm/transhuge.rst |  80 +++-
 include/linux/huge_mm.h                    |   5 +
 include/trace/events/huge_memory.h         |  34 +-
 mm/huge_memory.c                           |  11 +
 mm/khugepaged.c                            | 519 +++++++++++++++++----
 5 files changed, 522 insertions(+), 127 deletions(-)

-- 
2.53.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
@ 2026-02-26  3:22 ` Nico Pache
  2026-03-12 20:00   ` David Hildenbrand (Arm)
  2026-02-26  3:23 ` [PATCH mm-unstable v15 02/13] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:22 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

For khugepaged to support different mTHP orders, we must generalize this
to check if the PMD is not shared by another VMA and that the order is
enabled.

No functional change in this patch. Also correct a comment about the
functionality of the revalidation.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0058970d4579..c7f2c4a90910 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -895,12 +895,13 @@ static int collapse_find_target_node(struct collapse_control *cc)
 
 /*
  * If mmap_lock temporarily dropped, revalidate vma
- * before taking mmap_lock.
+ * after taking the mmap_lock again.
  * Returns enum scan_result value.
  */
 
 static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
-		bool expect_anon, struct vm_area_struct **vmap, struct collapse_control *cc)
+		bool expect_anon, struct vm_area_struct **vmap,
+		struct collapse_control *cc, unsigned int order)
 {
 	struct vm_area_struct *vma;
 	enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
@@ -913,15 +914,16 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l
 	if (!vma)
 		return SCAN_VMA_NULL;
 
+	/* Always check the PMD order to ensure its not shared by another VMA */
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
+	if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order)))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
 	 * remapped to file after khugepaged reaquired the mmap_lock.
 	 *
-	 * thp_vma_allowable_order may return true for qualified file
+	 * thp_vma_allowable_orders may return true for qualified file
 	 * vmas.
 	 */
 	if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
@@ -1114,7 +1116,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1148,7 +1151,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2863,7 +2867,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_locked = true;
 			*lock_dropped = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 cc, HPAGE_PMD_ORDER);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 02/13] mm/khugepaged: generalize alloc_charge_folio()
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
  2026-02-26  3:22 ` [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
@ 2026-02-26  3:23 ` Nico Pache
  2026-03-12 20:05   ` David Hildenbrand (Arm)
  2026-02-26  3:23 ` [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:23 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

From: Dev Jain <dev.jain@arm.com>

Pass order to alloc_charge_folio() and update mTHP statistics.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
 include/linux/huge_mm.h                    |  2 ++
 mm/huge_memory.c                           |  4 ++++
 mm/khugepaged.c                            | 17 +++++++++++------
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5fbc3d89bb07..c51932e6275d 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -639,6 +639,14 @@ anon_fault_fallback_charge
 	instead falls back to using huge pages with lower orders or
 	small pages even though the allocation was successful.
 
+collapse_alloc
+	is incremented every time a huge page is successfully allocated for a
+	khugepaged collapse.
+
+collapse_alloc_failed
+	is incremented every time a huge page allocation fails during a
+	khugepaged collapse.
+
 zswpout
 	is incremented every time a huge page is swapped out to zswap in one
 	piece without splitting.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bd7f0e1d8094..9941fc6d7bd8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,6 +128,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_COLLAPSE_ALLOC,
+	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a688d5ff806e..228f35e962b9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -624,6 +624,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -689,6 +691,8 @@ static struct attribute *any_stats_attrs[] = {
 #endif
 	&split_attr.attr,
 	&split_failed_attr.attr,
+	&collapse_alloc_attr.attr,
+	&collapse_alloc_failed_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c7f2c4a90910..a9b645402b7f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1061,21 +1061,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-		struct collapse_control *cc)
+		struct collapse_control *cc, unsigned int order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = collapse_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (is_pmd_order(order))
+			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
-	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (is_pmd_order(order))
+		count_vm_event(THP_COLLAPSE_ALLOC);
+	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
@@ -1111,7 +1116,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1891,7 +1896,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
  2026-02-26  3:22 ` [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
  2026-02-26  3:23 ` [PATCH mm-unstable v15 02/13] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2026-02-26  3:23 ` Nico Pache
  2026-03-12 20:32   ` David Hildenbrand (Arm)
  2026-02-26  3:24 ` [PATCH mm-unstable v15 04/13] mm/khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:23 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.

mTHP collapse will not honor the khugepaged_max_ptes_shared or
khugepaged_max_ptes_swap parameters, and will fail if it encounters a
shared or swapped entry.

No functional changes in this patch.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 73 +++++++++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a9b645402b7f..ecdbbf6a01a6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -535,7 +535,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
 
 static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
-		struct list_head *compound_pagelist)
+		unsigned int order, struct list_head *compound_pagelist)
 {
 	struct page *page = NULL;
 	struct folio *folio = NULL;
@@ -543,15 +543,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, referenced = 0;
 	enum scan_result result = SCAN_FAIL;
+	const unsigned long nr_pages = 1UL << order;
+	int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + nr_pages;
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -585,8 +587,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			/*
+			 * TODO: Support shared pages without leading to further
+			 * mTHP collapses. Currently bringing in new pages via
+			 * shared may cause a future higher order collapse on a
+			 * rescan of the same range.
+			 */
+			if (!is_pmd_order(order) || (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared)) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -679,18 +687,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 }
 
 static void __collapse_huge_page_copy_succeeded(pte_t *pte,
-						struct vm_area_struct *vma,
-						unsigned long address,
-						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+		struct vm_area_struct *vma, unsigned long address,
+		spinlock_t *ptl, unsigned int order,
+		struct list_head *compound_pagelist)
 {
-	unsigned long end = address + HPAGE_PMD_SIZE;
+	unsigned long end = address + (PAGE_SIZE << order);
 	struct folio *src, *tmp;
 	pte_t pteval;
 	pte_t *_pte;
 	unsigned int nr_ptes;
+	const unsigned long nr_pages = 1UL << order;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
 	     address += nr_ptes * PAGE_SIZE) {
 		nr_ptes = 1;
 		pteval = ptep_get(_pte);
@@ -743,13 +751,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 }
 
 static void __collapse_huge_page_copy_failed(pte_t *pte,
-					     pmd_t *pmd,
-					     pmd_t orig_pmd,
-					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
+		unsigned int order, struct list_head *compound_pagelist)
 {
 	spinlock_t *pmd_ptl;
-
+	const unsigned long nr_pages = 1UL << order;
 	/*
 	 * Re-establish the PMD to point to the original page table
 	 * entry. Restoring PMD needs to be done prior to releasing
@@ -763,7 +769,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
 }
 
 /*
@@ -783,16 +789,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
  */
 static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
-		unsigned long address, spinlock_t *ptl,
+		unsigned long address, spinlock_t *ptl, unsigned int order,
 		struct list_head *compound_pagelist)
 {
 	unsigned int i;
 	enum scan_result result = SCAN_SUCCEED;
-
+	const unsigned long nr_pages = 1UL << order;
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -811,10 +817,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    order, compound_pagelist);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 order, compound_pagelist);
 
 	return result;
 }
@@ -985,12 +991,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
  */
 static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
-		int referenced)
+		struct vm_area_struct *vma, unsigned long start_addr,
+		pmd_t *pmd, int referenced, unsigned int order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
 	enum scan_result result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1022,6 +1028,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 		    pte_present(vmf.orig_pte))
 			continue;
 
+		/*
+		 * TODO: Support swapin without leading to further mTHP
+		 * collapses. Currently bringing in new pages via swapin may
+		 * cause a future higher order collapse on a rescan of the same
+		 * range.
+		 */
+		if (!is_pmd_order(order)) {
+			pte_unmap(pte);
+			mmap_read_unlock(mm);
+			result = SCAN_EXCEED_SWAP_PTE;
+			goto out;
+		}
+
 		vmf.pte = pte;
 		vmf.ptl = ptl;
 		ret = do_swap_page(&vmf);
@@ -1141,7 +1160,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+						     referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1189,6 +1208,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
+						      HPAGE_PMD_ORDER,
 						      &compound_pagelist);
 		spin_unlock(pte_ptl);
 	} else {
@@ -1219,6 +1239,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
+					   HPAGE_PMD_ORDER,
 					   &compound_pagelist);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 04/13] mm/khugepaged: introduce collapse_max_ptes_none helper function
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (2 preceding siblings ...)
  2026-02-26  3:23 ` [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2026-02-26  3:24 ` Nico Pache
  2026-02-26  3:24 ` [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:24 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

The current mechanism for determining mTHP collapse scales the
khugepaged_max_ptes_none value based on the target order. This
introduces an undesirable feedback loop, or "creep", when max_ptes_none
is set to a value greater than HPAGE_PMD_NR / 2.

With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.

To fix this issue introduce a helper function that will limit mTHP
collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
This effectively supports two modes:

- max_ptes_none=0: never introduce new none-pages for mTHP collapse.
- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
  available mTHP order.

This removes the possiblilty of "creep", while not modifying any uAPI
expectations. A warning will be emitted if any non-supported
max_ptes_none value is configured with mTHP enabled.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ecdbbf6a01a6..99f78f0e44c6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -456,6 +456,36 @@ void __khugepaged_enter(struct mm_struct *mm)
 		wake_up_interruptible(&khugepaged_wait);
 }
 
+/**
+ * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
+ * @order: The folio order being collapsed to
+ *
+ * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
+ * khugepaged_max_ptes_none value.
+ *
+ * For mTHP collapses, we currently only support khugepaged_max_pte_none values
+ * of 0 or (COLLAPSE_MAX_PTES_LIMIT). Any other value will emit a warning and
+ * no mTHP collapse will be attempted
+ *
+ * Return: Maximum number of empty PTEs allowed for the collapse operation
+ */
+static unsigned int collapse_max_ptes_none(unsigned int order)
+{
+	if (is_pmd_order(order))
+		return khugepaged_max_ptes_none;
+
+	/* Zero/non-present collapse disabled. */
+	if (!khugepaged_max_ptes_none)
+		return 0;
+
+	if (khugepaged_max_ptes_none == COLLAPSE_MAX_PTES_LIMIT)
+		return (1 << order) - 1;
+
+	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
+		      COLLAPSE_MAX_PTES_LIMIT);
+	return -EINVAL;
+}
+
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
@@ -541,10 +571,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	struct folio *folio = NULL;
 	unsigned long addr = start_addr;
 	pte_t *_pte;
+	int max_ptes_none;
 	int none_or_zero = 0, shared = 0, referenced = 0;
 	enum scan_result result = SCAN_FAIL;
 	const unsigned long nr_pages = 1UL << order;
-	int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+
+	if (cc->is_khugepaged)
+		max_ptes_none = collapse_max_ptes_none(order);
+	else
+		max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
+
+	if (max_ptes_none == -EINVAL)
+		return result;
 
 	for (_pte = pte; _pte < pte + nr_pages;
 	     _pte++, addr += PAGE_SIZE) {
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (3 preceding siblings ...)
  2026-02-26  3:24 ` [PATCH mm-unstable v15 04/13] mm/khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
@ 2026-02-26  3:24 ` Nico Pache
  2026-03-17 16:51   ` Lorenzo Stoakes (Oracle)
  2026-02-26  3:24 ` [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:24 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

Pass an order and offset to collapse_huge_page to support collapsing anon
memory to arbitrary orders within a PMD. order indicates what mTHP size we
are attempting to collapse to, and offset indicates were in the PMD to
start the collapse attempt.

For non-PMD collapse we must leave the anon VMA write locked until after
we collapse the mTHP-- in the PMD case all the pages are isolated, but in
the mTHP case this is not true, and we must keep the lock to prevent
changes to the VMA from occurring.

Also convert these BUG_ON's to WARN_ON_ONCE's as these conditions, while
unexpected, should not bring down the system.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 102 +++++++++++++++++++++++++++++-------------------
 1 file changed, 62 insertions(+), 40 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 99f78f0e44c6..fb3ba8fe5a6c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1150,44 +1150,53 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
 	return SCAN_SUCCEED;
 }
 
-static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
-		int referenced, int unmapped, struct collapse_control *cc)
+static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
+		int referenced, int unmapped, struct collapse_control *cc,
+		bool *mmap_locked, unsigned int order)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte = NULL;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	enum scan_result result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	bool anon_vma_locked = false;
+	const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
 
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
 
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
 	 * The allocation can take potentially a long time if it involves
 	 * sync compaction, and we do not need to hold the mmap_lock during
 	 * that. We will recheck the vma after taking it again in write mode.
+	 * If collapsing mTHPs we may have already released the read_lock.
 	 */
-	mmap_read_unlock(mm);
+	if (*mmap_locked) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
+		*mmap_locked = false;
 		goto out_nolock;
 	}
 
-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
+		*mmap_locked = false;
 		goto out_nolock;
 	}
 
@@ -1197,13 +1206,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced, HPAGE_PMD_ORDER);
-		if (result != SCAN_SUCCEED)
+		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
+						     referenced, order);
+		if (result != SCAN_SUCCEED) {
+			*mmap_locked = false;
 			goto out_nolock;
+		}
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1213,20 +1225,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
 	vma_start_write(vma);
-	result = check_pmd_still_valid(mm, address, pmd);
+	result = check_pmd_still_valid(mm, pmd_address, pmd);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
 	anon_vma_lock_write(vma->anon_vma);
+	anon_vma_locked = true;
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
+				start_addr + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
@@ -1238,24 +1250,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * Parallel GUP-fast is fine since GUP-fast will back off when
 	 * it detects PMD is changed.
 	 */
-	_pmd = pmdp_collapse_flush(vma, address, pmd);
+	_pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      HPAGE_PMD_ORDER,
-						      &compound_pagelist);
+		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
+						      order, &compound_pagelist);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_NO_PTE_TABLE;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1265,21 +1274,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		 */
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
-		anon_vma_unlock_write(vma->anon_vma);
 		goto out_up_write;
 	}
 
 	/*
-	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
+	 * For PMD collapse all pages are isolated and locked so anon_vma
+	 * rmap can't run anymore. For mTHP collapse we must hold the lock
 	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	if (is_pmd_order(order)) {
+		anon_vma_unlock_write(vma->anon_vma);
+		anon_vma_locked = false;
+	}
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   HPAGE_PMD_ORDER,
-					   &compound_pagelist);
-	pte_unmap(pte);
+					   vma, start_addr, pte_ptl,
+					   order, &compound_pagelist);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
 
@@ -1289,20 +1298,34 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
+	if (is_pmd_order(order)) { /* PMD collapse */
+		pgtable = pmd_pgtable(_pmd);
 
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
+		spin_lock(pmd_ptl);
+		WARN_ON_ONCE(!pmd_none(*pmd));
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
+	} else { /* mTHP collapse */
+		spin_lock(pmd_ptl);
+		WARN_ON_ONCE(!pmd_none(*pmd));
+		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
+		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+	}
 	spin_unlock(pmd_ptl);
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
 out_up_write:
+	if (anon_vma_locked)
+		anon_vma_unlock_write(vma->anon_vma);
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
+	*mmap_locked = false;
 out_nolock:
+	WARN_ON_ONCE(*mmap_locked);
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
@@ -1483,9 +1506,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, start_addr, referenced,
-					    unmapped, cc);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+					    unmapped, cc, mmap_locked,
+					    HPAGE_PMD_ORDER);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (4 preceding siblings ...)
  2026-02-26  3:24 ` [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
@ 2026-02-26  3:24 ` Nico Pache
  2026-03-12 21:00   ` David Hildenbrand (Arm)
  2026-02-26  3:25 ` [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:24 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fb3ba8fe5a6c..c739f26dd61e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -638,6 +638,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				goto out;
 			}
 		}
+		/*
+		 * TODO: In some cases of partially-mapped folios, we'd actually
+		 * want to collapse.
+		 */
+		if (!is_pmd_order(order) && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
 
 		if (folio_test_large(folio)) {
 			struct folio *f;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (5 preceding siblings ...)
  2026-02-26  3:24 ` [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2026-02-26  3:25 ` Nico Pache
  2026-03-12 21:03   ` David Hildenbrand (Arm)
  2026-03-17 17:05   ` Lorenzo Stoakes (Oracle)
  2026-02-26  3:25 ` [PATCH mm-unstable v15 08/13] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:25 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

Add three new mTHP statistics to track collapse failures for different
orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:

- collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
	PTEs

- collapse_exceed_none_pte: Counts when mTHP collapse fails due to
  	exceeding the none PTE threshold for the given order

- collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
  	PTEs

These statistics complement the existing THP_SCAN_EXCEED_* events by
providing per-order granularity for mTHP collapse attempts. The stats are
exposed via sysfs under
`/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
supported hugepage size.

As we currently dont support collapsing mTHPs that contain a swap or
shared entry, those statistics keep track of how often we are
encountering failed mTHP collapses due to these restrictions.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
 include/linux/huge_mm.h                    |  3 +++
 mm/huge_memory.c                           |  7 +++++++
 mm/khugepaged.c                            | 16 ++++++++++++---
 4 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index c51932e6275d..eebb1f6bbc6c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -714,6 +714,30 @@ nr_anon_partially_mapped
        an anonymous THP as "partially mapped" and count it here, even though it
        is not actually partially mapped anymore.
 
+collapse_exceed_none_pte
+       The number of collapse attempts that failed due to exceeding the
+       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
+       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
+       emit a warning and no mTHP collapse will be attempted. khugepaged will
+       try to collapse to the largest enabled (m)THP size; if it fails, it will
+       try the next lower enabled mTHP size. This counter records the number of
+       times a collapse attempt was skipped for exceeding the max_ptes_none
+       threshold, and khugepaged will move on to the next available mTHP size.
+
+collapse_exceed_swap_pte
+       The number of anonymous mTHP PTE ranges which were unable to collapse due
+       to containing at least one swap PTE. Currently khugepaged does not
+       support collapsing mTHP regions that contain a swap PTE. This counter can
+       be used to monitor the number of khugepaged mTHP collapses that failed
+       due to the presence of a swap PTE.
+
+collapse_exceed_shared_pte
+       The number of anonymous mTHP PTE ranges which were unable to collapse due
+       to containing at least one shared PTE. Currently khugepaged does not
+       support collapsing mTHP PTE ranges that contain a shared PTE. This
+       counter can be used to monitor the number of khugepaged mTHP collapses
+       that failed due to the presence of a shared PTE.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9941fc6d7bd8..e8777bb2347d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -144,6 +144,9 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT_DEFERRED,
 	MTHP_STAT_NR_ANON,
 	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+	MTHP_STAT_COLLAPSE_EXCEED_NONE,
+	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 228f35e962b9..1049a207a257 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -642,6 +642,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
 DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
 DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -658,6 +662,9 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_deferred_attr.attr,
 	&nr_anon_attr.attr,
 	&nr_anon_partially_mapped_attr.attr,
+	&collapse_exceed_swap_pte_attr.attr,
+	&collapse_exceed_none_pte_attr.attr,
+	&collapse_exceed_shared_pte_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c739f26dd61e..a6cf90e09e4a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -595,7 +595,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				if (is_pmd_order(order))
+					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
 				goto out;
 			}
 		}
@@ -631,10 +633,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			 * shared may cause a future higher order collapse on a
 			 * rescan of the same range.
 			 */
-			if (!is_pmd_order(order) || (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared)) {
+			if (!is_pmd_order(order)) {
+				result = SCAN_EXCEED_SHARED_PTE;
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+				goto out;
+			}
+
+			if (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
 				goto out;
 			}
 		}
@@ -1081,6 +1090,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 		 * range.
 		 */
 		if (!is_pmd_order(order)) {
+			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
 			pte_unmap(pte);
 			mmap_read_unlock(mm);
 			result = SCAN_EXCEED_SWAP_PTE;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 08/13] mm/khugepaged: improve tracepoints for mTHP orders
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (6 preceding siblings ...)
  2026-02-26  3:25 ` [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
@ 2026-02-26  3:25 ` Nico Pache
  2026-03-12 21:05   ` David Hildenbrand (Arm)
  2026-02-26  3:25 ` [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:25 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to
give better insight into what order is being operated at for.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
 mm/khugepaged.c                    |  9 ++++----
 2 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index bcdc57eea270..c79dbcd60bdf 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -89,40 +89,44 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 TRACE_EVENT(mm_collapse_huge_page,
 
-	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+	TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order),
 
-	TP_ARGS(mm, isolated, status),
+	TP_ARGS(mm, isolated, status, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, isolated)
 		__field(int, status)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->isolated = isolated;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, isolated=%d, status=%s",
+	TP_printk("mm=%p, isolated=%d, status=%s order=%u",
 		__entry->mm,
 		__entry->isolated,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct folio *folio, int none_or_zero,
-		 int referenced, int status),
+		 int referenced, int status, unsigned int order),
 
-	TP_ARGS(folio, none_or_zero, referenced, status),
+	TP_ARGS(folio, none_or_zero, referenced, status, order),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
 		__field(int, none_or_zero)
 		__field(int, referenced)
 		__field(int, status)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
@@ -130,26 +134,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->none_or_zero = none_or_zero;
 		__entry->referenced = referenced;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s order=%u",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_swapin,
 
-	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+		 unsigned int order),
 
-	TP_ARGS(mm, swapped_in, referenced, ret),
+	TP_ARGS(mm, swapped_in, referenced, ret, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, swapped_in)
 		__field(int, referenced)
 		__field(int, ret)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
@@ -157,13 +165,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->swapped_in = swapped_in;
 		__entry->referenced = referenced;
 		__entry->ret = ret;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u",
 		__entry->mm,
 		__entry->swapped_in,
 		__entry->referenced,
-		__entry->ret)
+		__entry->ret,
+		__entry->order)
 );
 
 TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a6cf90e09e4a..2e66d660ee8e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -731,13 +731,13 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-						    referenced, result);
+						    referenced, result, order);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-					    referenced, result);
+					    referenced, result, order);
 	return result;
 }
 
@@ -1131,7 +1131,8 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 
 	result = SCAN_SUCCEED;
 out:
-	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+					   order);
 	return result;
 }
 
@@ -1346,7 +1347,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	WARN_ON_ONCE(*mmap_locked);
 	if (folio)
 		folio_put(folio);
-	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
 	return result;
 }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (7 preceding siblings ...)
  2026-02-26  3:25 ` [PATCH mm-unstable v15 08/13] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2026-02-26  3:25 ` Nico Pache
  2026-03-12 21:09   ` David Hildenbrand (Arm)
  2026-03-17 17:08   ` Lorenzo Stoakes (Oracle)
  2026-02-26  3:26 ` [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support Nico Pache
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:25 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

Add collapse_allowable_orders() to generalize THP order eligibility. The
function determines which THP orders are permitted based on collapse
context (khugepaged vs madv_collapse).

This consolidates collapse configuration logic and provides a clean
interface for future mTHP collapse support where the orders may be
different.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2e66d660ee8e..2fdfb6d42cf9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -486,12 +486,22 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
 	return -EINVAL;
 }
 
+/* Check what orders are allowed based on the vma and collapse type */
+static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
+			vm_flags_t vm_flags, bool is_khugepaged)
+{
+	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
+	unsigned long orders = BIT(HPAGE_PMD_ORDER);
+
+	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+}
+
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -2637,7 +2647,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/true)) {
 			progress++;
 			continue;
 		}
@@ -2949,7 +2959,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
 
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+	if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/false))
 		return -EINVAL;
 
 	cc = kmalloc_obj(*cc);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (8 preceding siblings ...)
  2026-02-26  3:25 ` [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
@ 2026-02-26  3:26 ` Nico Pache
  2026-03-12 21:16   ` David Hildenbrand (Arm)
  2026-03-17 21:36   ` Lorenzo Stoakes (Oracle)
  2026-02-26  3:26 ` [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:26 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

Enable khugepaged to collapse to mTHP orders. This patch implements the
main scanning logic using a bitmap to track occupied pages and a stack
structure that allows us to find optimal collapse sizes.

Previous to this patch, PMD collapse had 3 main phases, a light weight
scanning phase (mmap_read_lock) that determines a potential PMD
collapse, a alloc phase (mmap unlocked), then finally heavier collapse
phase (mmap_write_lock).

To enabled mTHP collapse we make the following changes:

During PMD scan phase, track occupied pages in a bitmap. When mTHP
orders are enabled, we remove the restriction of max_ptes_none during the
scan phase to avoid missing potential mTHP collapse candidates. Once we
have scanned the full PMD range and updated the bitmap to track occupied
pages, we use the bitmap to find the optimal mTHP size.

Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
and determine the best eligible order for the collapse. A stack structure
is used instead of traditional recursion to manage the search. The
algorithm recursively splits the bitmap into smaller chunks to find the
highest order mTHPs that satisfy the collapse criteria. We start by
attempting the PMD order, then moved on the consecutively lower orders
(mTHP collapse). The stack maintains a pair of variables (offset, order),
indicating the number of PTEs from the start of the PMD, and the order of
the potential collapse candidate.

The algorithm for consuming the bitmap works as such:
    1) push (0, HPAGE_PMD_ORDER) onto the stack
    2) pop the stack
    3) check if the number of set bits in that (offset,order) pair
       statisfy the max_ptes_none threshold for that order
    4) if yes, attempt collapse
    5) if no (or collapse fails), push two new stack items representing
       the left and right halves of the current bitmap range, at the
       next lower order
    6) repeat at step (2) until stack is empty.

Below is a diagram representing the algorithm and stack items:

                           offset       mid_offset
                            |         |
                            |         |
                            v         v
          ____________________________________
         |          PTE Page Table            |
         --------------------------------------
			    <-------><------->
                             order-1  order-1

We currently only support mTHP collapse for max_ptes_none values of 0
and HPAGE_PMD_NR - 1. resulting in the following behavior:

    - max_ptes_none=0: Never introduce new empty pages during collapse
    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
      available mTHP order

Any other max_ptes_none value will emit a warning and skip mTHP collapse
attempts. There should be no behavior change for PMD collapse.

Once we determine what mTHP sizes fits best in that PMD range a collapse
is attempted. A minimum collapse order of 2 is used as this is the lowest
order supported by anon memory as defined by THP_ORDERS_ALL_ANON.

mTHP collapses reject regions containing swapped out or shared pages.
This is because adding new entries can lead to new none pages, and these
may lead to constant promotion into a higher order (m)THP. A similar
issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
introducing at least 2x the number of pages, and on a future scan will
satisfy the promotion condition once again. This issue is prevented via
the collapse_max_ptes_none() function which imposes the max_ptes_none
restrictions above.

Currently madv_collapse is not supported and will only attempt PMD
collapse.

We can also remove the check for is_khugepaged inside the PMD scan as
the collapse_max_ptes_none() function handles this logic now.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 189 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 180 insertions(+), 9 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2fdfb6d42cf9..1c3711ed4513 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -99,6 +99,32 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+#define KHUGEPAGED_MIN_MTHP_ORDER	2
+/*
+ * The maximum number of mTHP ranges that can be stored on the stack.
+ * This is calculated based on the number of PTE entries in a PTE page table
+ * and the minimum mTHP order.
+ *
+ * ilog2(MAX_PTRS_PER_PTE) is log2 of the maximum number of PTE entries.
+ * This gives you the PMD_ORDER, and is needed in place of HPAGE_PMD_ORDER due
+ * to restrictions of some architectures (ie ppc64le).
+ *
+ * At most there will be 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER) mTHP ranges
+ */
+#define MTHP_STACK_SIZE	(1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
+
+/*
+ * Defines a range of PTE entries in a PTE page table which are being
+ * considered for (m)THP collapse.
+ *
+ * @offset: the offset of the first PTE entry in a PMD range.
+ * @order: the order of the PTE entries being considered for collapse.
+ */
+struct mthp_range {
+	u16 offset;
+	u8 order;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -107,6 +133,11 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/* bitmap used for mTHP collapse */
+	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
+	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
 };
 
 /**
@@ -1361,17 +1392,138 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	return result;
 }
 
+static void mthp_stack_push(struct collapse_control *cc, int *stack_size,
+				   u16 offset, u8 order)
+{
+	const int size = *stack_size;
+	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
+
+	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
+	stack->order = order;
+	stack->offset = offset;
+	(*stack_size)++;
+}
+
+static struct mthp_range mthp_stack_pop(struct collapse_control *cc, int *stack_size)
+{
+	const int size = *stack_size;
+
+	VM_WARN_ON_ONCE(size <= 0);
+	(*stack_size)--;
+	return cc->mthp_bitmap_stack[size - 1];
+}
+
+static unsigned int mthp_nr_occupied_pte_entries(struct collapse_control *cc,
+						 u16 offset, unsigned long nr_pte_entries)
+{
+	bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);
+	bitmap_set(cc->mthp_bitmap_mask, offset, nr_pte_entries);
+	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, HPAGE_PMD_NR);
+}
+
+/*
+ * mthp_collapse() consumes the bitmap that is generated during
+ * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
+ *
+ * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
+ * A stack structure cc->mthp_bitmap_stack is used to check different regions
+ * of the bitmap for collapse eligibility. The stack maintains a pair of
+ * variables (offset, order), indicating the number of PTEs from the start of
+ * the PMD, and the order of the potential collapse candidate respectively. We
+ * start at the PMD order and check if it is eligible for collapse; if not, we
+ * add two entries to the stack at a lower order to represent the left and right
+ * halves of the PTE page table we are examining.
+ *
+ *                         offset       mid_offset
+ *                         |         |
+ *                         |         |
+ *                         v         v
+ *      --------------------------------------
+ *      |          cc->mthp_bitmap            |
+ *      --------------------------------------
+ *                         <-------><------->
+ *                          order-1  order-1
+ *
+ * For each of these, we determine how many PTE entries are occupied in the
+ * range of PTE entries we propose to collapse, then we compare this to a
+ * threshold number of PTE entries which would need to be occupied for a
+ * collapse to be permitted at that order (accounting for max_ptes_none).
+
+ * If a collapse is permitted, we attempt to collapse the PTE range into a
+ * mTHP.
+ */
+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
+		int referenced, int unmapped, struct collapse_control *cc,
+		bool *mmap_locked, unsigned long enabled_orders)
+{
+	unsigned int max_ptes_none, nr_occupied_ptes;
+	struct mthp_range range;
+	unsigned long collapse_address;
+	int collapsed = 0, stack_size = 0;
+	unsigned long nr_pte_entries;
+	u16 offset;
+	u8 order;
+
+	mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
+
+	while (stack_size > 0) {
+		range = mthp_stack_pop(cc, &stack_size);
+		order = range.order;
+		offset = range.offset;
+		nr_pte_entries = 1UL << order;
+
+		if (!test_bit(order, &enabled_orders))
+			goto next_order;
+
+		if (cc->is_khugepaged)
+			max_ptes_none = collapse_max_ptes_none(order);
+		else
+			max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
+
+		if (max_ptes_none == -EINVAL)
+			return collapsed;
+
+		nr_occupied_ptes = mthp_nr_occupied_pte_entries(cc, offset, nr_pte_entries);
+
+		if (nr_occupied_ptes >= nr_pte_entries - max_ptes_none) {
+			int ret;
+
+			collapse_address = address + offset * PAGE_SIZE;
+			ret = collapse_huge_page(mm, collapse_address, referenced,
+						 unmapped, cc, mmap_locked,
+						 order);
+			if (ret == SCAN_SUCCEED) {
+				collapsed += nr_pte_entries;
+				continue;
+			}
+		}
+
+next_order:
+		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+			const u8 next_order = order - 1;
+			const u16 mid_offset = offset + (nr_pte_entries / 2);
+
+			mthp_stack_push(cc, &stack_size, mid_offset, next_order);
+			mthp_stack_push(cc, &stack_size, offset, next_order);
+		}
+	}
+	return collapsed;
+}
+
 static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
 		unsigned int *cur_progress, struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int none_or_zero = 0, shared = 0, referenced = 0;
+	int i;
+	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
 	enum scan_result result = SCAN_FAIL;
 	struct page *page = NULL;
+	unsigned int max_ptes_none;
 	struct folio *folio = NULL;
 	unsigned long addr;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
 
@@ -1384,8 +1536,21 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
+	bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged);
+
+	/*
+	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
+	 * scan all pages to populate the bitmap for mTHP collapse.
+	 */
+	if (cc->is_khugepaged && enabled_orders == BIT(HPAGE_PMD_ORDER))
+		max_ptes_none = collapse_max_ptes_none(HPAGE_PMD_ORDER);
+	else
+		max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
+
 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
 	if (!pte) {
 		if (cur_progress)
@@ -1394,17 +1559,18 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
-	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, addr += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		_pte = pte + i;
+		addr = start_addr + i * PAGE_SIZE;
+		pte_t pteval = ptep_get(_pte);
+
 		if (cur_progress)
 			*cur_progress += 1;
 
-		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			    none_or_zero <= max_ptes_none) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1478,6 +1644,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 			}
 		}
 
+		/* Set bit for occupied pages */
+		bitmap_set(cc->mthp_bitmap, i, 1);
 		/*
 		 * Record which node the original page is from and save this
 		 * information to cc->node_load[].
@@ -1534,9 +1702,12 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, start_addr, referenced,
-					    unmapped, cc, mmap_locked,
-					    HPAGE_PMD_ORDER);
+		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
+					      cc, mmap_locked, enabled_orders);
+		if (nr_collapsed > 0)
+			result = SCAN_SUCCEED;
+		else
+			result = SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (9 preceding siblings ...)
  2026-02-26  3:26 ` [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support Nico Pache
@ 2026-02-26  3:26 ` Nico Pache
  2026-02-26 16:26   ` Usama Arif
                     ` (2 more replies)
  2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
  2026-02-26  3:27 ` [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
  12 siblings, 3 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:26 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1c3711ed4513..388d3f2537e2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
 			ret = collapse_huge_page(mm, collapse_address, referenced,
 						 unmapped, cc, mmap_locked,
 						 order);
-			if (ret == SCAN_SUCCEED) {
+
+			switch (ret) {
+			/* Cases were we continue to next collapse candidate */
+			case SCAN_SUCCEED:
 				collapsed += nr_pte_entries;
+				fallthrough;
+			case SCAN_PTE_MAPPED_HUGEPAGE:
 				continue;
+			/* Cases were lower orders might still succeed */
+			case SCAN_LACK_REFERENCED_PAGE:
+			case SCAN_EXCEED_NONE_PTE:
+			case SCAN_EXCEED_SWAP_PTE:
+			case SCAN_EXCEED_SHARED_PTE:
+			case SCAN_PAGE_LOCK:
+			case SCAN_PAGE_COUNT:
+			case SCAN_PAGE_LRU:
+			case SCAN_PAGE_NULL:
+			case SCAN_DEL_PAGE_LRU:
+			case SCAN_PTE_NON_PRESENT:
+			case SCAN_PTE_UFFD_WP:
+			case SCAN_ALLOC_HUGE_PAGE_FAIL:
+				goto next_order;
+			/* Cases were no further collapse is possible */
+			case SCAN_CGROUP_CHARGE_FAIL:
+			case SCAN_COPY_MC:
+			case SCAN_ADDRESS_RANGE:
+			case SCAN_NO_PTE_TABLE:
+			case SCAN_ANY_PROCESS:
+			case SCAN_VMA_NULL:
+			case SCAN_VMA_CHECK:
+			case SCAN_SCAN_ABORT:
+			case SCAN_PAGE_ANON:
+			case SCAN_PMD_MAPPED:
+			case SCAN_FAIL:
+			default:
+				return collapsed;
 			}
 		}
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (10 preceding siblings ...)
  2026-02-26  3:26 ` [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2026-02-26  3:26 ` Nico Pache
  2026-02-26 15:53   ` Usama Arif
                     ` (3 more replies)
  2026-02-26  3:27 ` [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
  12 siblings, 4 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:26 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

From: Baolin Wang <baolin.wang@linux.alibaba.com>

If any order (m)THP is enabled we should allow running khugepaged to
attempt scanning and collapsing mTHPs. In order for khugepaged to operate
when only mTHP sizes are specified in sysfs, we must modify the predicate
function that determines whether it ought to run to do so.

This function is currently called hugepage_pmd_enabled(), this patch
renames it to hugepage_enabled() and updates the logic to check to
determine whether any valid orders may exist which would justify
khugepaged running.

We must also update collapse_allowable_orders() to check all orders if
the vma is anonymous and the collapse is khugepaged.

After this patch khugepaged mTHP collapse is fully enabled.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 388d3f2537e2..e8bfcc1d0c9a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -434,23 +434,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
 	 * hugepages, when configured in, are determined by the global control.
-	 * Anon pmd-sized hugepages are determined by the pmd-size control.
+	 * Anon hugepages are determined by its per-size mTHP control.
 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
 	 */
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+	if (READ_ONCE(huge_anon_orders_always))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+	if (READ_ONCE(huge_anon_orders_madvise))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+	if (READ_ONCE(huge_anon_orders_inherit) &&
 	    hugepage_global_enabled())
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -521,8 +521,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
 static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
 			vm_flags_t vm_flags, bool is_khugepaged)
 {
+	unsigned long orders;
 	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
-	unsigned long orders = BIT(HPAGE_PMD_ORDER);
+
+	/* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
+	if (is_khugepaged && vma_is_anonymous(vma))
+		orders = THP_ORDERS_ALL_ANON;
+	else
+		orders = BIT(HPAGE_PMD_ORDER);
 
 	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
@@ -531,7 +537,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
-	    hugepage_pmd_enabled()) {
+	    hugepage_enabled()) {
 		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
 			__khugepaged_enter(vma->vm_mm);
 	}
@@ -2929,7 +2935,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+	return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
 }
 
 static int khugepaged_wait_event(void)
@@ -3002,7 +3008,7 @@ static void khugepaged_wait_work(void)
 		return;
 	}
 
-	if (hugepage_pmd_enabled())
+	if (hugepage_enabled())
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
@@ -3033,7 +3039,7 @@ static void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!hugepage_enabled()) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -3083,7 +3089,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (hugepage_enabled()) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -3109,7 +3115,7 @@ int start_stop_khugepaged(void)
 void khugepaged_min_free_kbytes_update(void)
 {
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled() && khugepaged_thread)
+	if (hugepage_enabled() && khugepaged_thread)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse
  2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
                   ` (11 preceding siblings ...)
  2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
@ 2026-02-26  3:27 ` Nico Pache
  2026-03-17 11:02   ` Lorenzo Stoakes (Oracle)
  12 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-02-26  3:27 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, npache,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Bagas Sanjaya

Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidance on how to utilize it.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 48 +++++++++++++---------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index eebb1f6bbc6c..67836c683e8d 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,8 @@ often.
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages of either PMD size
+or mTHP sizes, if the system is configured to do so.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
 	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 
-khugepaged will be automatically started when PMD-sized THP is enabled
+khugepaged will be automatically started when any THP size is enabled
 (either of the per-size anon control or the top-level control are set
 to "always" or "madvise"), and it'll be automatically shutdown when
-PMD-sized THP is disabled (when both the per-size anon control and the
+all THP sizes are disabled (when both the per-size anon control and the
 top-level control are "never")
 
 process THP controls
@@ -264,11 +265,6 @@ support the following arguments::
 Khugepaged controls
 -------------------
 
-.. note::
-   khugepaged currently only searches for opportunities to collapse to
-   PMD-sized THP and no attempt is made to collapse to other THP
-   sizes.
-
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
@@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
 The khugepaged progress can be seen in the number of pages collapsed (note
 that this counter may not be an exact count of the number of pages
 collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
-being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
-one 2M hugepage. Each may happen independently, or together, depending on
-the type of memory and the failures that occur. As such, this value should
-be interpreted roughly as a sign of progress, and counters in /proc/vmstat
-consulted for more accurate accounting)::
+being replaced by a PMD mapping, or (2) physical pages replaced by one
+hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
+or together, depending on the type of memory and the failures that occur.
+As such, this value should be interpreted roughly as a sign of progress,
+and counters in /proc/vmstat consulted for more accurate accounting)::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
 
@@ -308,16 +304,19 @@ for each pass::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
 
-``max_ptes_none`` specifies how many extra small pages (that are
-not already mapped) can be allocated when collapsing a group
-of small pages into one large page::
+``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
+when collapsing a group of small pages into one large page::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
 
-A higher value leads to use additional memory for programs.
-A lower value leads to gain less thp performance. Value of
-max_ptes_none can waste cpu time very little, you can
-ignore it.
+For PMD-sized THP collapse, this directly limits the number of empty pages
+allowed in the 2MB region. For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1)
+are supported. Any other value will emit a warning and no mTHP collapse
+will be attempted.
+
+A higher value allows more empty pages, potentially leading to more memory
+usage but better THP performance. A lower value is more conservative and
+may result in fewer THP collapses.
 
 ``max_ptes_swap`` specifies how many pages can be brought in from
 swap when collapsing a group of pages into a transparent huge page::
@@ -337,6 +336,15 @@ that THP is shared. Exceeding the number would block the collapse::
 
 A higher value may increase memory footprint for some workloads.
 
+.. note::
+   For mTHP collapse, khugepaged does not support collapsing regions that
+   contain shared or swapped out pages, as this could lead to continuous
+   promotion to higher orders. The collapse will fail if any shared or
+   swapped PTEs are encountered during the scan.
+
+   Currently, madvise_collapse only supports collapsing to PMD-sized THPs
+   and does not attempt mTHP collapses.
+
 Boot parameters
 ===============
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
  2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
@ 2026-02-26 15:53   ` Usama Arif
  2026-03-12 21:22   ` David Hildenbrand (Arm)
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Usama Arif @ 2026-02-26 15:53 UTC (permalink / raw)
  To: Nico Pache
  Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, 25 Feb 2026 20:26:50 -0700 Nico Pache <npache@redhat.com> wrote:

> From: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> If any order (m)THP is enabled we should allow running khugepaged to
> attempt scanning and collapsing mTHPs. In order for khugepaged to operate
> when only mTHP sizes are specified in sysfs, we must modify the predicate
> function that determines whether it ought to run to do so.
> 
> This function is currently called hugepage_pmd_enabled(), this patch
> renames it to hugepage_enabled() and updates the logic to check to
> determine whether any valid orders may exist which would justify
> khugepaged running.
> 
> We must also update collapse_allowable_orders() to check all orders if
> the vma is anonymous and the collapse is khugepaged.
> 
> After this patch khugepaged mTHP collapse is fully enabled.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 

Acked-by: Usama Arif <usama.arif@linux.dev>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-02-26  3:26 ` [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2026-02-26 16:26   ` Usama Arif
  2026-02-26 20:47     ` Nico Pache
  2026-03-12 21:19   ` David Hildenbrand (Arm)
  2026-03-17 10:35   ` Lorenzo Stoakes (Oracle)
  2 siblings, 1 reply; 45+ messages in thread
From: Usama Arif @ 2026-02-26 16:26 UTC (permalink / raw)
  To: Nico Pache
  Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, 25 Feb 2026 20:26:31 -0700 Nico Pache <npache@redhat.com> wrote:

> There are cases where, if an attempted collapse fails, all subsequent
> orders are guaranteed to also fail. Avoid these collapse attempts by
> bailing out early.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1c3711ed4513..388d3f2537e2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>  			ret = collapse_huge_page(mm, collapse_address, referenced,
>  						 unmapped, cc, mmap_locked,
>  						 order);
> -			if (ret == SCAN_SUCCEED) {
> +
> +			switch (ret) {
> +			/* Cases were we continue to next collapse candidate */
> +			case SCAN_SUCCEED:
>  				collapsed += nr_pte_entries;
> +				fallthrough;
> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>  				continue;
> +			/* Cases were lower orders might still succeed */
> +			case SCAN_LACK_REFERENCED_PAGE:
> +			case SCAN_EXCEED_NONE_PTE:
> +			case SCAN_EXCEED_SWAP_PTE:
> +			case SCAN_EXCEED_SHARED_PTE:
> +			case SCAN_PAGE_LOCK:
> +			case SCAN_PAGE_COUNT:
> +			case SCAN_PAGE_LRU:
> +			case SCAN_PAGE_NULL:
> +			case SCAN_DEL_PAGE_LRU:
> +			case SCAN_PTE_NON_PRESENT:
> +			case SCAN_PTE_UFFD_WP:
> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> +				goto next_order;
> +			/* Cases were no further collapse is possible */
> +			case SCAN_CGROUP_CHARGE_FAIL:

The only one that stands out to me is SCAN_CGROUP_CHARGE_FAIL. memcg charging
of higher order folio might fail, but a lower order folio might pass?
That said, if the cgroup is that tight, continuing collapse work may not
be productive.

Acked-by: Usama Arif <usama.arif@linux.dev>

> +			case SCAN_COPY_MC:
> +			case SCAN_ADDRESS_RANGE:
> +			case SCAN_NO_PTE_TABLE:
> +			case SCAN_ANY_PROCESS:
> +			case SCAN_VMA_NULL:
> +			case SCAN_VMA_CHECK:
> +			case SCAN_SCAN_ABORT:
> +			case SCAN_PAGE_ANON:
> +			case SCAN_PMD_MAPPED:
> +			case SCAN_FAIL:
> +			default:
> +				return collapsed;
>  			}
>  		}
>  
> -- 
> 2.53.0
> 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-02-26 16:26   ` Usama Arif
@ 2026-02-26 20:47     ` Nico Pache
  0 siblings, 0 replies; 45+ messages in thread
From: Nico Pache @ 2026-02-26 20:47 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Thu, Feb 26, 2026 at 9:27 AM Usama Arif <usama.arif@linux.dev> wrote:
>
> On Wed, 25 Feb 2026 20:26:31 -0700 Nico Pache <npache@redhat.com> wrote:
>
> > There are cases where, if an attempted collapse fails, all subsequent
> > orders are guaranteed to also fail. Avoid these collapse attempts by
> > bailing out early.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> >  1 file changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 1c3711ed4513..388d3f2537e2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >                       ret = collapse_huge_page(mm, collapse_address, referenced,
> >                                                unmapped, cc, mmap_locked,
> >                                                order);
> > -                     if (ret == SCAN_SUCCEED) {
> > +
> > +                     switch (ret) {
> > +                     /* Cases were we continue to next collapse candidate */
> > +                     case SCAN_SUCCEED:
> >                               collapsed += nr_pte_entries;
> > +                             fallthrough;
> > +                     case SCAN_PTE_MAPPED_HUGEPAGE:
> >                               continue;
> > +                     /* Cases were lower orders might still succeed */
> > +                     case SCAN_LACK_REFERENCED_PAGE:
> > +                     case SCAN_EXCEED_NONE_PTE:
> > +                     case SCAN_EXCEED_SWAP_PTE:
> > +                     case SCAN_EXCEED_SHARED_PTE:
> > +                     case SCAN_PAGE_LOCK:
> > +                     case SCAN_PAGE_COUNT:
> > +                     case SCAN_PAGE_LRU:
> > +                     case SCAN_PAGE_NULL:
> > +                     case SCAN_DEL_PAGE_LRU:
> > +                     case SCAN_PTE_NON_PRESENT:
> > +                     case SCAN_PTE_UFFD_WP:
> > +                     case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > +                             goto next_order;
> > +                     /* Cases were no further collapse is possible */
> > +                     case SCAN_CGROUP_CHARGE_FAIL:
>
> The only one that stands out to me is SCAN_CGROUP_CHARGE_FAIL. memcg charging
> of higher order folio might fail, but a lower order folio might pass?
> That said, if the cgroup is that tight, continuing collapse work may not
> be productive.
>
> Acked-by: Usama Arif <usama.arif@linux.dev>

Thanks! IIRC, David and I discussed all of these off chain to confirm
their placement. I had this in the 'next_order' case at some point and
David recommended it to "fail" for the same reason you state here:
collapsing or charging large order pages in such a tight cgroup is
likely unproductive and not worth the effort.

In contrast, SCAN_ALLOC_HUGE_PAGE_FAIL does not necessarily indicate a
resource constraint, but it could. We might fail to allocate an N-page
size due to fragmentation, but we could easily find an (N-1) size. We
could also have a scenario where a lack of memory causes the failure,
iterating all the way down, which would be unproductive. However, at
that point the OOM reaper should be active and the system will already
be cornered in multiple ways, so it should be ok.

Hopefully that gives some insight into the decisions made here :)

Cheers,
-- Nico

>
> > +                     case SCAN_COPY_MC:
> > +                     case SCAN_ADDRESS_RANGE:
> > +                     case SCAN_NO_PTE_TABLE:
> > +                     case SCAN_ANY_PROCESS:
> > +                     case SCAN_VMA_NULL:
> > +                     case SCAN_VMA_CHECK:
> > +                     case SCAN_SCAN_ABORT:
> > +                     case SCAN_PAGE_ANON:
> > +                     case SCAN_PMD_MAPPED:
> > +                     case SCAN_FAIL:
> > +                     default:
> > +                             return collapsed;
> >                       }
> >               }
> >
> > --
> > 2.53.0
> >
> >
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2026-02-26  3:22 ` [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
@ 2026-03-12 20:00   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 20:00 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:22, Nico Pache wrote:
> For khugepaged to support different mTHP orders, we must generalize this
> to check if the PMD is not shared by another VMA and that the order is
> enabled.
> 
> No functional change in this patch. Also correct a comment about the
> functionality of the revalidation.
> 
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 02/13] mm/khugepaged: generalize alloc_charge_folio()
  2026-02-26  3:23 ` [PATCH mm-unstable v15 02/13] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2026-03-12 20:05   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 20:05 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:23, Nico Pache wrote:
> From: Dev Jain <dev.jain@arm.com>
> 
> Pass order to alloc_charge_folio() and update mTHP statistics.
> 
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-02-26  3:23 ` [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2026-03-12 20:32   ` David Hildenbrand (Arm)
  2026-03-12 20:36     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 20:32 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:23, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* functions
> to support future mTHP collapse.
> 
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
> 
> No functional changes in this patch.
> 
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 73 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 47 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a9b645402b7f..ecdbbf6a01a6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -535,7 +535,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>  
>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> -		struct list_head *compound_pagelist)
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
> @@ -543,15 +543,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
> +	const unsigned long nr_pages = 1UL << order;
> +	int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);

It might be a bit more readable to move "const unsigned long
nr_pages = 1UL << order;" all the way to the top.

Then, have here

	int max_ptes_none = 0;

and do at the beginning of the function:

	/* For MADV_COLLAPSE, we always collapse ... */
	if (!cc->is_khugepaged)
		max_ptes_none = HPAGE_PMD_NR;
	/*  ... except if userfaultf relies on MISSING faults. */
	if (!userfaultfd_armed(vma))
		max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);

(but see below regarding helper function)

then the code below becomes ...

>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + nr_pages;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
>  			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			     none_or_zero <= max_ptes_none)) {

...

	if (none_or_zero <= max_ptes_none) {


I see that you do something like that (but slightly different) in the next
patch. You could easily extend the above by it.

Or go one step further and move all of that conditional into collapse_max_ptes_none(), whereby
you simply also pass the cc and the vma.

Then this all gets cleaned up and you'd end up above with

max_ptes_none = collapse_max_ptes_none(cc, vma, order);
if (max_ptes_none < 0)
	return result;

I'd do all that in this patch here, getting rid of #4.


>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -585,8 +587,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
>  			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			/*
> +			 * TODO: Support shared pages without leading to further
> +			 * mTHP collapses. Currently bringing in new pages via
> +			 * shared may cause a future higher order collapse on a
> +			 * rescan of the same range.
> +			 */
> +			if (!is_pmd_order(order) || (cc->is_khugepaged &&
> +			    shared > khugepaged_max_ptes_shared)) {

That's not how we indent within a nested ().

To make this easier to read, what about similarly having at the beginning
of the function:

int max_ptes_shared = 0;

/* For MADV_COLLAPSE, we always collapse. */
if (cc->is_khugepaged)
	max_ptes_none = HPAGE_PMD_NR;
/* TODO ... */
if (is_pmd_order(order))
	max_ptes_none = khugepaged_max_ptes_shared;

to turn this code into a

	if (shared > khugepaged_max_ptes_shared)

Also, here, might make sense to have a collapse_max_ptes_swap(cc, order)
to do that and clean it up.


>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out;
> @@ -679,18 +687,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  }
>  
>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> -						struct vm_area_struct *vma,
> -						unsigned long address,
> -						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +		struct vm_area_struct *vma, unsigned long address,
> +		spinlock_t *ptl, unsigned int order,
> +		struct list_head *compound_pagelist)
>  {
> -	unsigned long end = address + HPAGE_PMD_SIZE;
> +	unsigned long end = address + (PAGE_SIZE << order);
>  	struct folio *src, *tmp;
>  	pte_t pteval;
>  	pte_t *_pte;
>  	unsigned int nr_ptes;
> +	const unsigned long nr_pages = 1UL << order;

Move it further to the top.

>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>  	     address += nr_ptes * PAGE_SIZE) {
>  		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
> @@ -743,13 +751,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  }
>  
>  static void __collapse_huge_page_copy_failed(pte_t *pte,
> -					     pmd_t *pmd,
> -					     pmd_t orig_pmd,
> -					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
>  	spinlock_t *pmd_ptl;
> -
> +	const unsigned long nr_pages = 1UL << order;
>  	/*
>  	 * Re-establish the PMD to point to the original page table
>  	 * entry. Restoring PMD needs to be done prior to releasing
> @@ -763,7 +769,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  	 * Release both raw and compound pages isolated
>  	 * in __collapse_huge_page_isolate.
>  	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>  }
>  
>  /*
> @@ -783,16 +789,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   */
>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> -		unsigned long address, spinlock_t *ptl,
> +		unsigned long address, spinlock_t *ptl, unsigned int order,
>  		struct list_head *compound_pagelist)
>  {
>  	unsigned int i;
>  	enum scan_result result = SCAN_SUCCEED;
> -
> +	const unsigned long nr_pages = 1UL << order;

Same here, all the way to the top.

>  	/*
>  	 * Copying pages' contents is subject to memory poison at any iteration.
>  	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < nr_pages; i++) {
>  		pte_t pteval = ptep_get(pte + i);
>  		struct page *page = folio_page(folio, i);
>  		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -811,10 +817,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>  
>  	if (likely(result == SCAN_SUCCEED))
>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    order, compound_pagelist);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 order, compound_pagelist);
>  
>  	return result;
>  }
> @@ -985,12 +991,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>   */
>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> -		int referenced)
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		pmd_t *pmd, int referenced, unsigned int order)
>  {
>  	int swapped_in = 0;
>  	vm_fault_t ret = 0;
> -	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>  	enum scan_result result;
>  	pte_t *pte = NULL;
>  	spinlock_t *ptl;
> @@ -1022,6 +1028,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		    pte_present(vmf.orig_pte))
>  			continue;
>  
> +		/*
> +		 * TODO: Support swapin without leading to further mTHP
> +		 * collapses. Currently bringing in new pages via swapin may
> +		 * cause a future higher order collapse on a rescan of the same
> +		 * range.
> +		 */
> +		if (!is_pmd_order(order)) {
> +			pte_unmap(pte);
> +			mmap_read_unlock(mm);
> +			result = SCAN_EXCEED_SWAP_PTE;
> +			goto out;
> +		}
> +

Interesting, we just swapin everything we find :)

But do we really need this check here? I mean, we just found it to be present.

In the rare event that there was a race, do we really care? It was just
present, now it's swapped. Bad luck. Just swap it in.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-03-12 20:32   ` David Hildenbrand (Arm)
@ 2026-03-12 20:36     ` David Hildenbrand (Arm)
  2026-03-12 20:56       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 20:36 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 3/12/26 21:32, David Hildenbrand (Arm) wrote:
> On 2/26/26 04:23, Nico Pache wrote:
>> generalize the order of the __collapse_huge_page_* functions
>> to support future mTHP collapse.
>>
>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>> shared or swapped entry.
>>
>> No functional changes in this patch.
>>
>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>  mm/khugepaged.c | 73 +++++++++++++++++++++++++++++++------------------
>>  1 file changed, 47 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index a9b645402b7f..ecdbbf6a01a6 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -535,7 +535,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>>  
>>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>> -		struct list_head *compound_pagelist)
>> +		unsigned int order, struct list_head *compound_pagelist)
>>  {
>>  	struct page *page = NULL;
>>  	struct folio *folio = NULL;
>> @@ -543,15 +543,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  	pte_t *_pte;
>>  	int none_or_zero = 0, shared = 0, referenced = 0;
>>  	enum scan_result result = SCAN_FAIL;
>> +	const unsigned long nr_pages = 1UL << order;
>> +	int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> 
> It might be a bit more readable to move "const unsigned long
> nr_pages = 1UL << order;" all the way to the top.
> 
> Then, have here
> 
> 	int max_ptes_none = 0;
> 
> and do at the beginning of the function:
> 
> 	/* For MADV_COLLAPSE, we always collapse ... */
> 	if (!cc->is_khugepaged)
> 		max_ptes_none = HPAGE_PMD_NR;
> 	/*  ... except if userfaultf relies on MISSING faults. */
> 	if (!userfaultfd_armed(vma))
> 		max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> 
> (but see below regarding helper function)
> 
> then the code below becomes ...
> 
>>  
>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +	for (_pte = pte; _pte < pte + nr_pages;
>>  	     _pte++, addr += PAGE_SIZE) {
>>  		pte_t pteval = ptep_get(_pte);
>>  		if (pte_none_or_zero(pteval)) {
>>  			++none_or_zero;
>>  			if (!userfaultfd_armed(vma) &&
>>  			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> +			     none_or_zero <= max_ptes_none)) {
> 
> ...
> 
> 	if (none_or_zero <= max_ptes_none) {
> 
> 
> I see that you do something like that (but slightly different) in the next
> patch. You could easily extend the above by it.
> 
> Or go one step further and move all of that conditional into collapse_max_ptes_none(), whereby
> you simply also pass the cc and the vma.
> 
> Then this all gets cleaned up and you'd end up above with
> 
> max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> if (max_ptes_none < 0)
> 	return result;
> 
> I'd do all that in this patch here, getting rid of #4.
> 
> 
>>  				continue;
>>  			} else {
>>  				result = SCAN_EXCEED_NONE_PTE;
>> @@ -585,8 +587,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  		/* See collapse_scan_pmd(). */
>>  		if (folio_maybe_mapped_shared(folio)) {
>>  			++shared;
>> -			if (cc->is_khugepaged &&
>> -			    shared > khugepaged_max_ptes_shared) {
>> +			/*
>> +			 * TODO: Support shared pages without leading to further
>> +			 * mTHP collapses. Currently bringing in new pages via
>> +			 * shared may cause a future higher order collapse on a
>> +			 * rescan of the same range.
>> +			 */
>> +			if (!is_pmd_order(order) || (cc->is_khugepaged &&
>> +			    shared > khugepaged_max_ptes_shared)) {
> 
> That's not how we indent within a nested ().
> 
> To make this easier to read, what about similarly having at the beginning
> of the function:
> 
> int max_ptes_shared = 0;
> 
> /* For MADV_COLLAPSE, we always collapse. */
> if (cc->is_khugepaged)
> 	max_ptes_none = HPAGE_PMD_NR;
> /* TODO ... */
> if (is_pmd_order(order))
> 	max_ptes_none = khugepaged_max_ptes_shared;
> 
> to turn this code into a
> 
> 	if (shared > khugepaged_max_ptes_shared)
> 
> Also, here, might make sense to have a collapse_max_ptes_swap(cc, order)
> to do that and clean it up.
> 
> 
>>  				result = SCAN_EXCEED_SHARED_PTE;
>>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>  				goto out;
>> @@ -679,18 +687,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  }
>>  
>>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>> -						struct vm_area_struct *vma,
>> -						unsigned long address,
>> -						spinlock_t *ptl,
>> -						struct list_head *compound_pagelist)
>> +		struct vm_area_struct *vma, unsigned long address,
>> +		spinlock_t *ptl, unsigned int order,
>> +		struct list_head *compound_pagelist)
>>  {
>> -	unsigned long end = address + HPAGE_PMD_SIZE;
>> +	unsigned long end = address + (PAGE_SIZE << order);
>>  	struct folio *src, *tmp;
>>  	pte_t pteval;
>>  	pte_t *_pte;
>>  	unsigned int nr_ptes;
>> +	const unsigned long nr_pages = 1UL << order;
> 
> Move it further to the top.
> 
>>  
>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
>> +	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>>  	     address += nr_ptes * PAGE_SIZE) {
>>  		nr_ptes = 1;
>>  		pteval = ptep_get(_pte);
>> @@ -743,13 +751,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>  }
>>  
>>  static void __collapse_huge_page_copy_failed(pte_t *pte,
>> -					     pmd_t *pmd,
>> -					     pmd_t orig_pmd,
>> -					     struct vm_area_struct *vma,
>> -					     struct list_head *compound_pagelist)
>> +		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>> +		unsigned int order, struct list_head *compound_pagelist)
>>  {
>>  	spinlock_t *pmd_ptl;
>> -
>> +	const unsigned long nr_pages = 1UL << order;
>>  	/*
>>  	 * Re-establish the PMD to point to the original page table
>>  	 * entry. Restoring PMD needs to be done prior to releasing
>> @@ -763,7 +769,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>  	 * Release both raw and compound pages isolated
>>  	 * in __collapse_huge_page_isolate.
>>  	 */
>> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
>> +	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>>  }
>>  
>>  /*
>> @@ -783,16 +789,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>   */
>>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>> -		unsigned long address, spinlock_t *ptl,
>> +		unsigned long address, spinlock_t *ptl, unsigned int order,
>>  		struct list_head *compound_pagelist)
>>  {
>>  	unsigned int i;
>>  	enum scan_result result = SCAN_SUCCEED;
>> -
>> +	const unsigned long nr_pages = 1UL << order;
> 
> Same here, all the way to the top.
> 
>>  	/*
>>  	 * Copying pages' contents is subject to memory poison at any iteration.
>>  	 */
>> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
>> +	for (i = 0; i < nr_pages; i++) {
>>  		pte_t pteval = ptep_get(pte + i);
>>  		struct page *page = folio_page(folio, i);
>>  		unsigned long src_addr = address + i * PAGE_SIZE;
>> @@ -811,10 +817,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>>  
>>  	if (likely(result == SCAN_SUCCEED))
>>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
>> -						    compound_pagelist);
>> +						    order, compound_pagelist);
>>  	else
>>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>> -						 compound_pagelist);
>> +						 order, compound_pagelist);
>>  
>>  	return result;
>>  }
>> @@ -985,12 +991,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>>   */
>>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>> -		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
>> -		int referenced)
>> +		struct vm_area_struct *vma, unsigned long start_addr,
>> +		pmd_t *pmd, int referenced, unsigned int order)
>>  {
>>  	int swapped_in = 0;
>>  	vm_fault_t ret = 0;
>> -	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
>> +	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>>  	enum scan_result result;
>>  	pte_t *pte = NULL;
>>  	spinlock_t *ptl;
>> @@ -1022,6 +1028,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>>  		    pte_present(vmf.orig_pte))
>>  			continue;
>>  
>> +		/*
>> +		 * TODO: Support swapin without leading to further mTHP
>> +		 * collapses. Currently bringing in new pages via swapin may
>> +		 * cause a future higher order collapse on a rescan of the same
>> +		 * range.
>> +		 */
>> +		if (!is_pmd_order(order)) {
>> +			pte_unmap(pte);
>> +			mmap_read_unlock(mm);
>> +			result = SCAN_EXCEED_SWAP_PTE;
>> +			goto out;
>> +		}
>> +
> 
> Interesting, we just swapin everything we find :)
> 
> But do we really need this check here? I mean, we just found it to be present.
> 
> In the rare event that there was a race, do we really care? It was just
> present, now it's swapped. Bad luck. Just swap it in.
> 

Okay, now I am confused. Why are you not taking care of
collapse_scan_pmd() in the same context?

Because if you make sure that we properly check against a max_ptes_swap
similar as in the style above, we'd rule out swapin right from the start?

Also, I would expect that all other parameters in there are similarly
handled?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-03-12 20:36     ` David Hildenbrand (Arm)
@ 2026-03-12 20:56       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 20:56 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 3/12/26 21:36, David Hildenbrand (Arm) wrote:
> On 3/12/26 21:32, David Hildenbrand (Arm) wrote:
>> On 2/26/26 04:23, Nico Pache wrote:
>>> generalize the order of the __collapse_huge_page_* functions
>>> to support future mTHP collapse.
>>>
>>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>>> shared or swapped entry.
>>>
>>> No functional changes in this patch.
>>>
>>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>  mm/khugepaged.c | 73 +++++++++++++++++++++++++++++++------------------
>>>  1 file changed, 47 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index a9b645402b7f..ecdbbf6a01a6 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -535,7 +535,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>>>  
>>>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>>> -		struct list_head *compound_pagelist)
>>> +		unsigned int order, struct list_head *compound_pagelist)
>>>  {
>>>  	struct page *page = NULL;
>>>  	struct folio *folio = NULL;
>>> @@ -543,15 +543,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>  	pte_t *_pte;
>>>  	int none_or_zero = 0, shared = 0, referenced = 0;
>>>  	enum scan_result result = SCAN_FAIL;
>>> +	const unsigned long nr_pages = 1UL << order;
>>> +	int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>>
>> It might be a bit more readable to move "const unsigned long
>> nr_pages = 1UL << order;" all the way to the top.
>>
>> Then, have here
>>
>> 	int max_ptes_none = 0;
>>
>> and do at the beginning of the function:
>>
>> 	/* For MADV_COLLAPSE, we always collapse ... */
>> 	if (!cc->is_khugepaged)
>> 		max_ptes_none = HPAGE_PMD_NR;
>> 	/*  ... except if userfaultf relies on MISSING faults. */
>> 	if (!userfaultfd_armed(vma))
>> 		max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>>
>> (but see below regarding helper function)
>>
>> then the code below becomes ...
>>
>>>  
>>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> +	for (_pte = pte; _pte < pte + nr_pages;
>>>  	     _pte++, addr += PAGE_SIZE) {
>>>  		pte_t pteval = ptep_get(_pte);
>>>  		if (pte_none_or_zero(pteval)) {
>>>  			++none_or_zero;
>>>  			if (!userfaultfd_armed(vma) &&
>>>  			    (!cc->is_khugepaged ||
>>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>>> +			     none_or_zero <= max_ptes_none)) {
>>
>> ...
>>
>> 	if (none_or_zero <= max_ptes_none) {
>>
>>
>> I see that you do something like that (but slightly different) in the next
>> patch. You could easily extend the above by it.
>>
>> Or go one step further and move all of that conditional into collapse_max_ptes_none(), whereby
>> you simply also pass the cc and the vma.
>>
>> Then this all gets cleaned up and you'd end up above with
>>
>> max_ptes_none = collapse_max_ptes_none(cc, vma, order);
>> if (max_ptes_none < 0)
>> 	return result;
>>
>> I'd do all that in this patch here, getting rid of #4.
>>
>>
>>>  				continue;
>>>  			} else {
>>>  				result = SCAN_EXCEED_NONE_PTE;
>>> @@ -585,8 +587,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>  		/* See collapse_scan_pmd(). */
>>>  		if (folio_maybe_mapped_shared(folio)) {
>>>  			++shared;
>>> -			if (cc->is_khugepaged &&
>>> -			    shared > khugepaged_max_ptes_shared) {
>>> +			/*
>>> +			 * TODO: Support shared pages without leading to further
>>> +			 * mTHP collapses. Currently bringing in new pages via
>>> +			 * shared may cause a future higher order collapse on a
>>> +			 * rescan of the same range.
>>> +			 */
>>> +			if (!is_pmd_order(order) || (cc->is_khugepaged &&
>>> +			    shared > khugepaged_max_ptes_shared)) {
>>
>> That's not how we indent within a nested ().
>>
>> To make this easier to read, what about similarly having at the beginning
>> of the function:
>>
>> int max_ptes_shared = 0;
>>
>> /* For MADV_COLLAPSE, we always collapse. */
>> if (cc->is_khugepaged)
>> 	max_ptes_none = HPAGE_PMD_NR;
>> /* TODO ... */
>> if (is_pmd_order(order))
>> 	max_ptes_none = khugepaged_max_ptes_shared;
>>
>> to turn this code into a
>>
>> 	if (shared > khugepaged_max_ptes_shared)
>>
>> Also, here, might make sense to have a collapse_max_ptes_swap(cc, order)
>> to do that and clean it up.
>>
>>
>>>  				result = SCAN_EXCEED_SHARED_PTE;
>>>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>>  				goto out;
>>> @@ -679,18 +687,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>  }
>>>  
>>>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>> -						struct vm_area_struct *vma,
>>> -						unsigned long address,
>>> -						spinlock_t *ptl,
>>> -						struct list_head *compound_pagelist)
>>> +		struct vm_area_struct *vma, unsigned long address,
>>> +		spinlock_t *ptl, unsigned int order,
>>> +		struct list_head *compound_pagelist)
>>>  {
>>> -	unsigned long end = address + HPAGE_PMD_SIZE;
>>> +	unsigned long end = address + (PAGE_SIZE << order);
>>>  	struct folio *src, *tmp;
>>>  	pte_t pteval;
>>>  	pte_t *_pte;
>>>  	unsigned int nr_ptes;
>>> +	const unsigned long nr_pages = 1UL << order;
>>
>> Move it further to the top.
>>
>>>  
>>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
>>> +	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>>>  	     address += nr_ptes * PAGE_SIZE) {
>>>  		nr_ptes = 1;
>>>  		pteval = ptep_get(_pte);
>>> @@ -743,13 +751,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>  }
>>>  
>>>  static void __collapse_huge_page_copy_failed(pte_t *pte,
>>> -					     pmd_t *pmd,
>>> -					     pmd_t orig_pmd,
>>> -					     struct vm_area_struct *vma,
>>> -					     struct list_head *compound_pagelist)
>>> +		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>>> +		unsigned int order, struct list_head *compound_pagelist)
>>>  {
>>>  	spinlock_t *pmd_ptl;
>>> -
>>> +	const unsigned long nr_pages = 1UL << order;
>>>  	/*
>>>  	 * Re-establish the PMD to point to the original page table
>>>  	 * entry. Restoring PMD needs to be done prior to releasing
>>> @@ -763,7 +769,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>>  	 * Release both raw and compound pages isolated
>>>  	 * in __collapse_huge_page_isolate.
>>>  	 */
>>> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
>>> +	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>>>  }
>>>  
>>>  /*
>>> @@ -783,16 +789,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>>   */
>>>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>>> -		unsigned long address, spinlock_t *ptl,
>>> +		unsigned long address, spinlock_t *ptl, unsigned int order,
>>>  		struct list_head *compound_pagelist)
>>>  {
>>>  	unsigned int i;
>>>  	enum scan_result result = SCAN_SUCCEED;
>>> -
>>> +	const unsigned long nr_pages = 1UL << order;
>>
>> Same here, all the way to the top.
>>
>>>  	/*
>>>  	 * Copying pages' contents is subject to memory poison at any iteration.
>>>  	 */
>>> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
>>> +	for (i = 0; i < nr_pages; i++) {
>>>  		pte_t pteval = ptep_get(pte + i);
>>>  		struct page *page = folio_page(folio, i);
>>>  		unsigned long src_addr = address + i * PAGE_SIZE;
>>> @@ -811,10 +817,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>>>  
>>>  	if (likely(result == SCAN_SUCCEED))
>>>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
>>> -						    compound_pagelist);
>>> +						    order, compound_pagelist);
>>>  	else
>>>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>>> -						 compound_pagelist);
>>> +						 order, compound_pagelist);
>>>  
>>>  	return result;
>>>  }
>>> @@ -985,12 +991,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>>>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>>>   */
>>>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>>> -		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
>>> -		int referenced)
>>> +		struct vm_area_struct *vma, unsigned long start_addr,
>>> +		pmd_t *pmd, int referenced, unsigned int order)
>>>  {
>>>  	int swapped_in = 0;
>>>  	vm_fault_t ret = 0;
>>> -	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
>>> +	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>>>  	enum scan_result result;
>>>  	pte_t *pte = NULL;
>>>  	spinlock_t *ptl;
>>> @@ -1022,6 +1028,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>>>  		    pte_present(vmf.orig_pte))
>>>  			continue;
>>>  
>>> +		/*
>>> +		 * TODO: Support swapin without leading to further mTHP
>>> +		 * collapses. Currently bringing in new pages via swapin may
>>> +		 * cause a future higher order collapse on a rescan of the same
>>> +		 * range.
>>> +		 */
>>> +		if (!is_pmd_order(order)) {
>>> +			pte_unmap(pte);
>>> +			mmap_read_unlock(mm);
>>> +			result = SCAN_EXCEED_SWAP_PTE;
>>> +			goto out;
>>> +		}
>>> +
>>
>> Interesting, we just swapin everything we find :)
>>
>> But do we really need this check here? I mean, we just found it to be present.
>>
>> In the rare event that there was a race, do we really care? It was just
>> present, now it's swapped. Bad luck. Just swap it in.
>>
> 
> Okay, now I am confused. Why are you not taking care of
> collapse_scan_pmd() in the same context?
> 
> Because if you make sure that we properly check against a max_ptes_swap
> similar as in the style above, we'd rule out swapin right from the start?
> 
> Also, I would expect that all other parameters in there are similarly
> handled?
> 

Okay, I think you should add the following:

From 17bce81ab93f3b16e044ac2f4f62be19aac38180 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Thu, 12 Mar 2026 21:54:22 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/khugepaged.c | 89 +++++++++++++++++++++++++++++--------------------
 1 file changed, 53 insertions(+), 36 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7b4680d27ab..6a3773bfa0a2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -318,6 +318,34 @@ static ssize_t max_ptes_shared_store(struct kobject *kobj,
 	return count;
 }
 
+static int collapse_max_ptes_none(struct collapse_control *cc,
+		struct vm_area_struct *vma)
+{
+	/* We don't mess with MISSING faults. */
+	if (vma && userfaultfd_armed(vma))
+		return 0;
+	/* MADV_COLLAPSE always collapses. */
+	if (!cc->is_khugepaged)
+		return HPAGE_PMD_NR;
+	return khugepaged_max_ptes_none;
+}
+
+static int collapse_max_ptes_shared(struct collapse_control *cc)
+{
+	/* MADV_COLLAPSE always collapses. */
+	if (!cc->is_khugepaged)
+		return HPAGE_PMD_NR;
+	return khugepaged_max_ptes_shared;
+}
+
+static int collapse_max_ptes_swap(struct collapse_control *cc)
+{
+	/* MADV_COLLAPSE always collapses. */
+	if (!cc->is_khugepaged)
+		return HPAGE_PMD_NR;
+	return khugepaged_max_ptes_swap;
+}
+
 static struct kobj_attribute khugepaged_max_ptes_shared_attr =
 	__ATTR_RW(max_ptes_shared);
 
@@ -539,6 +567,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
 		struct list_head *compound_pagelist)
 {
+	const int max_ptes_none = collapse_max_ptes_none(cc, vma);
+	const int max_ptes_shared = collapse_max_ptes_shared(cc);
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr = start_addr;
@@ -550,16 +580,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
-			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
+			if (++none_or_zero > max_ptes_none) {
 				result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out;
 			}
+			continue;
 		}
 		if (!pte_present(pteval)) {
 			result = SCAN_PTE_NON_PRESENT;
@@ -586,9 +612,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
-			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (++shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -1247,6 +1271,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		bool *mmap_locked, struct collapse_control *cc)
 {
+	const int max_ptes_none = collapse_max_ptes_none(cc, vma);
+	const int max_ptes_swap = collapse_max_ptes_swap(cc);
+	const int max_ptes_shared = collapse_max_ptes_shared(cc);
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
 	int none_or_zero = 0, shared = 0, referenced = 0;
@@ -1280,36 +1307,28 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
-			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
+			if (++none_or_zero > max_ptes_none) {
 				result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out_unmap;
 			}
+			continue;
 		}
 		if (!pte_present(pteval)) {
-			++unmapped;
-			if (!cc->is_khugepaged ||
-			    unmapped <= khugepaged_max_ptes_swap) {
-				/*
-				 * Always be strict with uffd-wp
-				 * enabled swap entries.  Please see
-				 * comment below for pte_uffd_wp().
-				 */
-				if (pte_swp_uffd_wp_any(pteval)) {
-					result = SCAN_PTE_UFFD_WP;
-					goto out_unmap;
-				}
-				continue;
-			} else {
+			if (++unmapped > max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				goto out_unmap;
 			}
+			/*
+			 * Always be strict with uffd-wp enabled swap entries.
+			 * See the comment below for pte_uffd_wp().
+			 */
+			if (pte_swp_uffd_wp_any(pteval)) {
+				result = SCAN_PTE_UFFD_WP;
+				goto out_unmap;
+			}
+			continue;
 		}
 		if (pte_uffd_wp(pteval)) {
 			/*
@@ -1348,9 +1367,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * is shared.
 		 */
 		if (folio_maybe_mapped_shared(folio)) {
-			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (++shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out_unmap;
@@ -2305,6 +2322,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
 		unsigned long addr, struct file *file, pgoff_t start,
 		struct collapse_control *cc)
 {
+	const int max_ptes_none = collapse_max_ptes_none(cc, NULL);
+	const int max_ptes_swap = collapse_max_ptes_swap(cc);
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
 	XA_STATE(xas, &mapping->i_pages, start);
@@ -2323,8 +2342,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
 
 		if (xa_is_value(folio)) {
 			swap += 1 << xas_get_order(&xas);
-			if (cc->is_khugepaged &&
-			    swap > khugepaged_max_ptes_swap) {
+			if (swap > max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
@@ -2395,8 +2413,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
 		cc->progress += HPAGE_PMD_NR;
 
 	if (result == SCAN_SUCCEED) {
-		if (cc->is_khugepaged &&
-		    present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (present < HPAGE_PMD_NR - max_ptes_none) {
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-- 
2.43.0


Then extend it by passing an order + return value check in this patch here. You can
directly squash changes from patch #4 in here then.

-- 
Cheers,

David


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders
  2026-02-26  3:24 ` [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2026-03-12 21:00   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 21:00 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:24, Nico Pache wrote:
> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> some pages being unmapped. Skip these cases until we have a way to check
> if its ok to collapse to a smaller mTHP size (like in the case of a
> partially mapped folio).
> 
> This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
> 
> [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fb3ba8fe5a6c..c739f26dd61e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -638,6 +638,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  				goto out;
>  			}
>  		}
> +		/*
> +		 * TODO: In some cases of partially-mapped folios, we'd actually
> +		 * want to collapse.
> +		 */
> +		if (!is_pmd_order(order) && folio_order(folio) >= order) {
> +			result = SCAN_PTE_MAPPED_HUGEPAGE;
> +			goto out;
> +		}
>  
>  		if (folio_test_large(folio)) {
>  			struct folio *f;

Why aren't we doing the same in hpage_collapse_scan_pmd() ?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
  2026-02-26  3:25 ` [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
@ 2026-03-12 21:03   ` David Hildenbrand (Arm)
  2026-03-17 17:05   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 21:03 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:25, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> 
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> 	PTEs
> 
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
> 
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
>   	PTEs
> 
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
> 
> As we currently dont support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 16 ++++++++++++---
>  4 files changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..eebb1f6bbc6c 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,30 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>  
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
> +       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
> +       emit a warning and no mTHP collapse will be attempted. khugepaged will
> +       try to collapse to the largest enabled (m)THP size; if it fails, it will
> +       try the next lower enabled mTHP size. This counter records the number of
> +       times a collapse attempt was skipped for exceeding the max_ptes_none
> +       threshold, and khugepaged will move on to the next available mTHP size.
> +
> +collapse_exceed_swap_pte
> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> +       to containing at least one swap PTE. Currently khugepaged does not
> +       support collapsing mTHP regions that contain a swap PTE. This counter can
> +       be used to monitor the number of khugepaged mTHP collapses that failed
> +       due to the presence of a swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> +       to containing at least one shared PTE. Currently khugepaged does not
> +       support collapsing mTHP PTE ranges that contain a shared PTE. This
> +       counter can be used to monitor the number of khugepaged mTHP collapses
> +       that failed due to the presence of a shared PTE.
> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9941fc6d7bd8..e8777bb2347d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 228f35e962b9..1049a207a257 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -642,6 +642,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +
>  
>  static struct attribute *anon_stats_attrs[] = {
>  	&anon_fault_alloc_attr.attr,
> @@ -658,6 +662,9 @@ static struct attribute *anon_stats_attrs[] = {
>  	&split_deferred_attr.attr,
>  	&nr_anon_attr.attr,
>  	&nr_anon_partially_mapped_attr.attr,
> +	&collapse_exceed_swap_pte_attr.attr,
> +	&collapse_exceed_none_pte_attr.attr,
> +	&collapse_exceed_shared_pte_attr.attr,
>  	NULL,
>  };
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c739f26dd61e..a6cf90e09e4a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -595,7 +595,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out;
>  			}
>  		}
> @@ -631,10 +633,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			 * shared may cause a future higher order collapse on a
>  			 * rescan of the same range.
>  			 */
> -			if (!is_pmd_order(order) || (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared)) {
> +			if (!is_pmd_order(order)) {
> +				result = SCAN_EXCEED_SHARED_PTE;
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +				goto out;
> +			}
> +
> +			if (cc->is_khugepaged &&
> +			    shared > khugepaged_max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;

With the suggested earlier rework, this should hopefully become simply

if (++shared > max_ptes_shared) {
	result = SCAN_EXCEED_SHARED_PTE;
	if (is_pmd_order(order))
		count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
	count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
}

With that (no code duplication) LGTM.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 08/13] mm/khugepaged: improve tracepoints for mTHP orders
  2026-02-26  3:25 ` [PATCH mm-unstable v15 08/13] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2026-03-12 21:05   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 21:05 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:25, Nico Pache wrote:
> Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to
> give better insight into what order is being operated at for.
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function
  2026-02-26  3:25 ` [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
@ 2026-03-12 21:09   ` David Hildenbrand (Arm)
  2026-03-17 17:08   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 21:09 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:25, Nico Pache wrote:
> Add collapse_allowable_orders() to generalize THP order eligibility. The
> function determines which THP orders are permitted based on collapse
> context (khugepaged vs madv_collapse).
> 
> This consolidates collapse configuration logic and provides a clean
> interface for future mTHP collapse support where the orders may be
> different.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2e66d660ee8e..2fdfb6d42cf9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -486,12 +486,22 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
>  	return -EINVAL;
>  }
>  
> +/* Check what orders are allowed based on the vma and collapse type */
> +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> +			vm_flags_t vm_flags, bool is_khugepaged)

Nit: one tab to much

> +{
> +	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> +	unsigned long orders = BIT(HPAGE_PMD_ORDER);
> +
> +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +}
> +
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -2637,7 +2647,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
>  			progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/true)) {

I'm not sure if converting from a perfectly readable enum to a boolean
is an improvement?

I would just keep the TVA_KHUGEPAGED / TVA_FORCED_COLLAPSE here/

If you want to catch callers passing in something else, you could likely
use a BUILD_BUG_ON in there.


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support
  2026-02-26  3:26 ` [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support Nico Pache
@ 2026-03-12 21:16   ` David Hildenbrand (Arm)
  2026-03-17 21:36   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 21:16 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:26, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
> 
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, a alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
> 
> To enabled mTHP collapse we make the following changes:
> 
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
> 
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. The
> algorithm recursively splits the bitmap into smaller chunks to find the
> highest order mTHPs that satisfy the collapse criteria. We start by
> attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
> 
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
> 
> Below is a diagram representing the algorithm and stack items:
> 
>                            offset       mid_offset
>                             |         |
>                             |         |
>                             v         v
>           ____________________________________
>          |          PTE Page Table            |
>          --------------------------------------
> 			    <-------><------->
>                              order-1  order-1
> 
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> 
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order
> 
> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> attempts. There should be no behavior change for PMD collapse.
> 
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> 
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order (m)THP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
> 
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
> 
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---


[...]


>  /**
> @@ -1361,17 +1392,138 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>  
> +static void mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				   u16 offset, u8 order)

Nit: indentation. Same for other functions.

Wondering if you'd want to call these functions

collapse_mthp_*

> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range mthp_stack_pop(struct collapse_control *cc, int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int mthp_nr_occupied_pte_entries(struct collapse_control *cc,
> +						 u16 offset, unsigned long nr_pte_entries)

s/pte_entries/ptes/ ?

> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_pte_entries);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, HPAGE_PMD_NR);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> +
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	unsigned int max_ptes_none, nr_occupied_ptes;
> +	struct mthp_range range;
> +	unsigned long collapse_address;
> +	int collapsed = 0, stack_size = 0;
> +	unsigned long nr_pte_entries;

"nr_ptes" ? Any reason for that to be an unsigned long?

> +	u16 offset;
> +	u8 order;
> +
> +	mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size > 0) {
> +		range = mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_pte_entries = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		if (cc->is_khugepaged)
> +			max_ptes_none = collapse_max_ptes_none(order);
> +		else
> +			max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
> +
> +		if (max_ptes_none == -EINVAL)
> +			return collapsed;

With the previous suggested rework, you could likely make this

	max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
	if (max_ptes_none < 0)
		return collapsed;

> +
> +		nr_occupied_ptes = mthp_nr_occupied_pte_entries(cc, offset, nr_pte_entries);
> +
> +		if (nr_occupied_ptes >= nr_pte_entries - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, mmap_locked,
> +						 order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_pte_entries;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_pte_entries / 2);
> +
> +			mthp_stack_push(cc, &stack_size, mid_offset, next_order);
> +			mthp_stack_push(cc, &stack_size, offset, next_order);
> +		}
> +	}
> +	return collapsed;
> +}
> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
>  		unsigned int *cur_progress, struct collapse_control *cc)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  	struct page *page = NULL;
> +	unsigned int max_ptes_none;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>  
> @@ -1384,8 +1536,21 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> +	bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */
> +	if (cc->is_khugepaged && enabled_orders == BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = collapse_max_ptes_none(HPAGE_PMD_ORDER);
> +	else
> +		max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
> +

I assume that code to change as well. If you need help figuring out how
to make it work, please shout.

[...]

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-02-26  3:26 ` [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
  2026-02-26 16:26   ` Usama Arif
@ 2026-03-12 21:19   ` David Hildenbrand (Arm)
  2026-03-17 10:35   ` Lorenzo Stoakes (Oracle)
  2 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 21:19 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:26, Nico Pache wrote:
> There are cases where, if an attempted collapse fails, all subsequent
> orders are guaranteed to also fail. Avoid these collapse attempts by
> bailing out early.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1c3711ed4513..388d3f2537e2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>  			ret = collapse_huge_page(mm, collapse_address, referenced,
>  						 unmapped, cc, mmap_locked,
>  						 order);
> -			if (ret == SCAN_SUCCEED) {
> +
> +			switch (ret) {
> +			/* Cases were we continue to next collapse candidate */
> +			case SCAN_SUCCEED:
>  				collapsed += nr_pte_entries;
> +				fallthrough;
> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>  				continue;
> +			/* Cases were lower orders might still succeed */
> +			case SCAN_LACK_REFERENCED_PAGE:
> +			case SCAN_EXCEED_NONE_PTE:
> +			case SCAN_EXCEED_SWAP_PTE:
> +			case SCAN_EXCEED_SHARED_PTE:
> +			case SCAN_PAGE_LOCK:
> +			case SCAN_PAGE_COUNT:
> +			case SCAN_PAGE_LRU:
> +			case SCAN_PAGE_NULL:
> +			case SCAN_DEL_PAGE_LRU:
> +			case SCAN_PTE_NON_PRESENT:
> +			case SCAN_PTE_UFFD_WP:
> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> +				goto next_order;
> +			/* Cases were no further collapse is possible */
> +			case SCAN_CGROUP_CHARGE_FAIL:
> +			case SCAN_COPY_MC:
> +			case SCAN_ADDRESS_RANGE:
> +			case SCAN_NO_PTE_TABLE:
> +			case SCAN_ANY_PROCESS:
> +			case SCAN_VMA_NULL:
> +			case SCAN_VMA_CHECK:
> +			case SCAN_SCAN_ABORT:
> +			case SCAN_PAGE_ANON:
> +			case SCAN_PMD_MAPPED:
> +			case SCAN_FAIL:
> +			default:
> +				return collapsed;
>  			}
>  		}
>  

LGTM, but I do wonder, given that you have the "default" case, why spell
out the ones that fall into the "default" category? I'd strip those

/* For all other cases no futher collapse is possible */
default:
	return collapsed;


Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
  2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
  2026-02-26 15:53   ` Usama Arif
@ 2026-03-12 21:22   ` David Hildenbrand (Arm)
  2026-03-17 10:58   ` Lorenzo Stoakes (Oracle)
  2026-03-17 11:36   ` Lance Yang
  3 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-12 21:22 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On 2/26/26 04:26, Nico Pache wrote:
> From: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> If any order (m)THP is enabled we should allow running khugepaged to
> attempt scanning and collapsing mTHPs. In order for khugepaged to operate
> when only mTHP sizes are specified in sysfs, we must modify the predicate
> function that determines whether it ought to run to do so.
> 
> This function is currently called hugepage_pmd_enabled(), this patch
> renames it to hugepage_enabled() and updates the logic to check to
> determine whether any valid orders may exist which would justify
> khugepaged running.
> 
> We must also update collapse_allowable_orders() to check all orders if
> the vma is anonymous and the collapse is khugepaged.
> 
> After this patch khugepaged mTHP collapse is fully enabled.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

Nothing jumped at me

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-02-26  3:26 ` [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
  2026-02-26 16:26   ` Usama Arif
  2026-03-12 21:19   ` David Hildenbrand (Arm)
@ 2026-03-17 10:35   ` Lorenzo Stoakes (Oracle)
  2026-03-18 18:59     ` Nico Pache
  2 siblings, 1 reply; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 10:35 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, Feb 25, 2026 at 08:26:31PM -0700, Nico Pache wrote:
> There are cases where, if an attempted collapse fails, all subsequent
> orders are guaranteed to also fail. Avoid these collapse attempts by
> bailing out early.
>
> Signed-off-by: Nico Pache <npache@redhat.com>

With David's concern addressed:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1c3711ed4513..388d3f2537e2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>  			ret = collapse_huge_page(mm, collapse_address, referenced,
>  						 unmapped, cc, mmap_locked,
>  						 order);
> -			if (ret == SCAN_SUCCEED) {
> +
> +			switch (ret) {
> +			/* Cases were we continue to next collapse candidate */
> +			case SCAN_SUCCEED:
>  				collapsed += nr_pte_entries;
> +				fallthrough;
> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>  				continue;
> +			/* Cases were lower orders might still succeed */
> +			case SCAN_LACK_REFERENCED_PAGE:
> +			case SCAN_EXCEED_NONE_PTE:
> +			case SCAN_EXCEED_SWAP_PTE:
> +			case SCAN_EXCEED_SHARED_PTE:
> +			case SCAN_PAGE_LOCK:
> +			case SCAN_PAGE_COUNT:
> +			case SCAN_PAGE_LRU:
> +			case SCAN_PAGE_NULL:
> +			case SCAN_DEL_PAGE_LRU:
> +			case SCAN_PTE_NON_PRESENT:
> +			case SCAN_PTE_UFFD_WP:
> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> +				goto next_order;
> +			/* Cases were no further collapse is possible */
> +			case SCAN_CGROUP_CHARGE_FAIL:
> +			case SCAN_COPY_MC:
> +			case SCAN_ADDRESS_RANGE:
> +			case SCAN_NO_PTE_TABLE:
> +			case SCAN_ANY_PROCESS:
> +			case SCAN_VMA_NULL:
> +			case SCAN_VMA_CHECK:
> +			case SCAN_SCAN_ABORT:
> +			case SCAN_PAGE_ANON:
> +			case SCAN_PMD_MAPPED:
> +			case SCAN_FAIL:
> +			default:

Agree with david, let's spell them out please :)

> +				return collapsed;
>  			}
>  		}
>
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
  2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
  2026-02-26 15:53   ` Usama Arif
  2026-03-12 21:22   ` David Hildenbrand (Arm)
@ 2026-03-17 10:58   ` Lorenzo Stoakes (Oracle)
  2026-03-18 19:02     ` Nico Pache
  2026-03-17 11:36   ` Lance Yang
  3 siblings, 1 reply; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 10:58 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, Feb 25, 2026 at 08:26:50PM -0700, Nico Pache wrote:
> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> If any order (m)THP is enabled we should allow running khugepaged to
> attempt scanning and collapsing mTHPs. In order for khugepaged to operate
> when only mTHP sizes are specified in sysfs, we must modify the predicate
> function that determines whether it ought to run to do so.
>
> This function is currently called hugepage_pmd_enabled(), this patch
> renames it to hugepage_enabled() and updates the logic to check to
> determine whether any valid orders may exist which would justify
> khugepaged running.
>
> We must also update collapse_allowable_orders() to check all orders if
> the vma is anonymous and the collapse is khugepaged.
>
> After this patch khugepaged mTHP collapse is fully enabled.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

This looks good to me, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 388d3f2537e2..e8bfcc1d0c9a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -434,23 +434,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>  		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>  }
>
> -static bool hugepage_pmd_enabled(void)
> +static bool hugepage_enabled(void)
>  {
>  	/*
>  	 * We cover the anon, shmem and the file-backed case here; file-backed
>  	 * hugepages, when configured in, are determined by the global control.
> -	 * Anon pmd-sized hugepages are determined by the pmd-size control.
> +	 * Anon hugepages are determined by its per-size mTHP control.

Well also PMD right? I mean this terminology sucks because in a sense mTHP
includes PMD... :)

>  	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
>  	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
>  	 */
>  	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>  	    hugepage_global_enabled())
>  		return true;
> -	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
> +	if (READ_ONCE(huge_anon_orders_always))
>  		return true;
> -	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
> +	if (READ_ONCE(huge_anon_orders_madvise))
>  		return true;
> -	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
> +	if (READ_ONCE(huge_anon_orders_inherit) &&
>  	    hugepage_global_enabled())
>  		return true;
>  	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
> @@ -521,8 +521,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
>  static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
>  			vm_flags_t vm_flags, bool is_khugepaged)
>  {
> +	unsigned long orders;
>  	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> -	unsigned long orders = BIT(HPAGE_PMD_ORDER);
> +
> +	/* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
> +	if (is_khugepaged && vma_is_anonymous(vma))
> +		orders = THP_ORDERS_ALL_ANON;
> +	else
> +		orders = BIT(HPAGE_PMD_ORDER);
>
>  	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>  }
> @@ -531,7 +537,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> -	    hugepage_pmd_enabled()) {
> +	    hugepage_enabled()) {
>  		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
> @@ -2929,7 +2935,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
>
>  static int khugepaged_has_work(void)
>  {
> -	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
> +	return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
>  }
>
>  static int khugepaged_wait_event(void)
> @@ -3002,7 +3008,7 @@ static void khugepaged_wait_work(void)
>  		return;
>  	}
>
> -	if (hugepage_pmd_enabled())
> +	if (hugepage_enabled())
>  		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
>  }
>
> @@ -3033,7 +3039,7 @@ static void set_recommended_min_free_kbytes(void)
>  	int nr_zones = 0;
>  	unsigned long recommended_min;
>
> -	if (!hugepage_pmd_enabled()) {
> +	if (!hugepage_enabled()) {
>  		calculate_min_free_kbytes();
>  		goto update_wmarks;
>  	}
> @@ -3083,7 +3089,7 @@ int start_stop_khugepaged(void)
>  	int err = 0;
>
>  	mutex_lock(&khugepaged_mutex);
> -	if (hugepage_pmd_enabled()) {
> +	if (hugepage_enabled()) {
>  		if (!khugepaged_thread)
>  			khugepaged_thread = kthread_run(khugepaged, NULL,
>  							"khugepaged");
> @@ -3109,7 +3115,7 @@ int start_stop_khugepaged(void)
>  void khugepaged_min_free_kbytes_update(void)
>  {
>  	mutex_lock(&khugepaged_mutex);
> -	if (hugepage_pmd_enabled() && khugepaged_thread)
> +	if (hugepage_enabled() && khugepaged_thread)
>  		set_recommended_min_free_kbytes();
>  	mutex_unlock(&khugepaged_mutex);
>  }
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse
  2026-02-26  3:27 ` [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2026-03-17 11:02   ` Lorenzo Stoakes (Oracle)
  2026-03-18 19:08     ` Nico Pache
  0 siblings, 1 reply; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 11:02 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Bagas Sanjaya

On Wed, Feb 25, 2026 at 08:27:06PM -0700, Nico Pache wrote:
> Now that we can collapse to mTHPs lets update the admin guide to
> reflect these changes and provide proper guidance on how to utilize it.
>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

LGTM, but maybe we should mention somewhere about mTHP's max_ptes_none
behaviour?

Anyway with that addressed:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  Documentation/admin-guide/mm/transhuge.rst | 48 +++++++++++++---------
>  1 file changed, 28 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index eebb1f6bbc6c..67836c683e8d 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -63,7 +63,8 @@ often.
>  THP can be enabled system wide or restricted to certain tasks or even
>  memory ranges inside task's address space. Unless THP is completely
>  disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages of either PMD size
> +or mTHP sizes, if the system is configured to do so.
>
>  The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
>  interface and using madvise(2) and prctl(2) system calls.
> @@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
>  	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>  	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>
> -khugepaged will be automatically started when PMD-sized THP is enabled
> +khugepaged will be automatically started when any THP size is enabled
>  (either of the per-size anon control or the top-level control are set
>  to "always" or "madvise"), and it'll be automatically shutdown when
> -PMD-sized THP is disabled (when both the per-size anon control and the
> +all THP sizes are disabled (when both the per-size anon control and the
>  top-level control are "never")
>
>  process THP controls
> @@ -264,11 +265,6 @@ support the following arguments::
>  Khugepaged controls
>  -------------------
>
> -.. note::
> -   khugepaged currently only searches for opportunities to collapse to
> -   PMD-sized THP and no attempt is made to collapse to other THP
> -   sizes.
> -
>  khugepaged runs usually at low frequency so while one may not want to
>  invoke defrag algorithms synchronously during the page faults, it
>  should be worth invoking defrag at least in khugepaged. However it's
> @@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
>  The khugepaged progress can be seen in the number of pages collapsed (note
>  that this counter may not be an exact count of the number of pages
>  collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
> -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
> -one 2M hugepage. Each may happen independently, or together, depending on
> -the type of memory and the failures that occur. As such, this value should
> -be interpreted roughly as a sign of progress, and counters in /proc/vmstat
> -consulted for more accurate accounting)::
> +being replaced by a PMD mapping, or (2) physical pages replaced by one
> +hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
> +or together, depending on the type of memory and the failures that occur.
> +As such, this value should be interpreted roughly as a sign of progress,
> +and counters in /proc/vmstat consulted for more accurate accounting)::
>
>  	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
>
> @@ -308,16 +304,19 @@ for each pass::
>
>  	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
>
> -``max_ptes_none`` specifies how many extra small pages (that are
> -not already mapped) can be allocated when collapsing a group
> -of small pages into one large page::
> +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
> +when collapsing a group of small pages into one large page::
>
>  	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>
> -A higher value leads to use additional memory for programs.
> -A lower value leads to gain less thp performance. Value of
> -max_ptes_none can waste cpu time very little, you can
> -ignore it.
> +For PMD-sized THP collapse, this directly limits the number of empty pages
> +allowed in the 2MB region. For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1)
> +are supported. Any other value will emit a warning and no mTHP collapse
> +will be attempted.
> +
> +A higher value allows more empty pages, potentially leading to more memory
> +usage but better THP performance. A lower value is more conservative and
> +may result in fewer THP collapses.
>
>  ``max_ptes_swap`` specifies how many pages can be brought in from
>  swap when collapsing a group of pages into a transparent huge page::
> @@ -337,6 +336,15 @@ that THP is shared. Exceeding the number would block the collapse::
>
>  A higher value may increase memory footprint for some workloads.
>
> +.. note::
> +   For mTHP collapse, khugepaged does not support collapsing regions that
> +   contain shared or swapped out pages, as this could lead to continuous
> +   promotion to higher orders. The collapse will fail if any shared or
> +   swapped PTEs are encountered during the scan.
> +
> +   Currently, madvise_collapse only supports collapsing to PMD-sized THPs
> +   and does not attempt mTHP collapses.
> +
>  Boot parameters
>  ===============
>
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
  2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
                     ` (2 preceding siblings ...)
  2026-03-17 10:58   ` Lorenzo Stoakes (Oracle)
@ 2026-03-17 11:36   ` Lance Yang
  2026-03-18 19:07     ` Nico Pache
  3 siblings, 1 reply; 45+ messages in thread
From: Lance Yang @ 2026-03-17 11:36 UTC (permalink / raw)
  To: baolin.wang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe


On Wed, Feb 25, 2026 at 08:26:50PM -0700, Nico Pache wrote:
>From: Baolin Wang <baolin.wang@linux.alibaba.com>
>
>If any order (m)THP is enabled we should allow running khugepaged to
>attempt scanning and collapsing mTHPs. In order for khugepaged to operate
>when only mTHP sizes are specified in sysfs, we must modify the predicate
>function that determines whether it ought to run to do so.
>
>This function is currently called hugepage_pmd_enabled(), this patch
>renames it to hugepage_enabled() and updates the logic to check to
>determine whether any valid orders may exist which would justify
>khugepaged running.
>
>We must also update collapse_allowable_orders() to check all orders if
>the vma is anonymous and the collapse is khugepaged.
>
>After this patch khugepaged mTHP collapse is fully enabled.
>
>Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 388d3f2537e2..e8bfcc1d0c9a 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -434,23 +434,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
> 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
> }
> 
>-static bool hugepage_pmd_enabled(void)
>+static bool hugepage_enabled(void)
> {
> 	/*
> 	 * We cover the anon, shmem and the file-backed case here; file-backed
> 	 * hugepages, when configured in, are determined by the global control.
>-	 * Anon pmd-sized hugepages are determined by the pmd-size control.
>+	 * Anon hugepages are determined by its per-size mTHP control.
> 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
> 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
> 	 */
> 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> 	    hugepage_global_enabled())
> 		return true;
>-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
>+	if (READ_ONCE(huge_anon_orders_always))
> 		return true;
>-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
>+	if (READ_ONCE(huge_anon_orders_madvise))
> 		return true;
>-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
>+	if (READ_ONCE(huge_anon_orders_inherit) &&
> 	    hugepage_global_enabled())
> 		return true;
> 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
>@@ -521,8 +521,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
> static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> 			vm_flags_t vm_flags, bool is_khugepaged)
> {
>+	unsigned long orders;
> 	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>-	unsigned long orders = BIT(HPAGE_PMD_ORDER);
>+
>+	/* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
>+	if (is_khugepaged && vma_is_anonymous(vma))
>+		orders = THP_ORDERS_ALL_ANON;
>+	else
>+		orders = BIT(HPAGE_PMD_ORDER);
> 
> 	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> }

IIUC, an anonymous VMA can pass collapse_allowable_orders() even if it
is smaller than 2MB ...

But collapse_scan_mm_slot() still scans only full PMD-sized windows:

		hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
		hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
		if (khugepaged_scan.address > hend) {
			cc->progress++;
			continue;
		}

and hugepage_vma_revalidate() still requires PMD suitability:

	/* Always check the PMD order to ensure its not shared by another VMA */
	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
		return SCAN_ADDRESS_RANGE;


>@@ -531,7 +537,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> 			  vm_flags_t vm_flags)
> {
> 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>-	    hugepage_pmd_enabled()) {
>+	    hugepage_enabled()) {
> 		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
> 			__khugepaged_enter(vma->vm_mm);

I wonder if we should also require at least one PMD-sized scan window
here? Not a big deal, just might be good to tighten the gate a bit :)

Apart from that, LGTM!
Reviewed-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-02-26  3:24 ` [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
@ 2026-03-17 16:51   ` Lorenzo Stoakes (Oracle)
  2026-03-17 17:16     ` Randy Dunlap
  0 siblings, 1 reply; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 16:51 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, Feb 25, 2026 at 08:24:27PM -0700, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in

The '--' seems weird here :) maybe meant to be ' - '?

> the mTHP case this is not true, and we must keep the lock to prevent
> changes to the VMA from occurring.

You mean changes to the page tables right? rmap won't alter VMA parameters
without a VMA lock. Better to be specific.

>
> Also convert these BUG_ON's to WARN_ON_ONCE's as these conditions, while
> unexpected, should not bring down the system.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 102 +++++++++++++++++++++++++++++-------------------
>  1 file changed, 62 insertions(+), 40 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 99f78f0e44c6..fb3ba8fe5a6c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1150,44 +1150,53 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>  	return SCAN_SUCCEED;
>  }
>
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -		int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		bool *mmap_locked, unsigned int order)

This is getting horrible, could we maybe look at passing through a helper
struct or something?

>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte = NULL;
>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	enum scan_result result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	bool anon_vma_locked = false;
> +	const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;

We have start_addr and pmd_address, let's make our mind up and call both
either addr or address please.

>
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +	VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);

You just masked this with HPAGE_PMD_MASK then check & ~HPAGE_PMD_MASK? :)

Can we just drop it? :)

>
>  	/*
>  	 * Before allocating the hugepage, release the mmap_lock read lock.
>  	 * The allocation can take potentially a long time if it involves
>  	 * sync compaction, and we do not need to hold the mmap_lock during
>  	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>  	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}

If you use a helper struct you can write a function that'll do both of
these at once, E.g.:

static void scan_mmap_unlock(struct scan_state *scan)
{
	if (!scan->mmap_locked)
		return;

	mmap_read_unlock(scan->mm);
	scan->mmap_locked = false;
}

	...

	scan_mmap_unlock(scan_state);

>
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);

Be nice to add a /*expect_anon=*/true, here so we can read what parameter
that is at a glance.

>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
> +		*mmap_locked = false;
>  		goto out_nolock;
>  	}
>
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
> +		*mmap_locked = false;
>  		goto out_nolock;
>  	}
>
> @@ -1197,13 +1206,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, HPAGE_PMD_ORDER);
> -		if (result != SCAN_SUCCEED)
> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> +						     referenced, order);
> +		if (result != SCAN_SUCCEED) {
> +			*mmap_locked = false;
>  			goto out_nolock;
> +		}
>  	}
>
>  	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>  	/*
>  	 * Prevent all access to pagetables with the exception of
>  	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1213,20 +1225,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);

Hmm you take an mmap... write lock here then don/t set *mmap_locked =
true... It's inconsistent and bug prone.

I'm also seriously not a fan of switching between mmap read and write lock
here but keeping an *mmap_locked parameter here which is begging for a bug.

In general though, you seem to always make sure in the (fairly hideous
honestly) error goto labels to have the mmap lock dropped, so what is the
point in keeping the *mmap_locked parameter updated throughou this anyway?

Are we ever exiting with it set? If not why not drop the parameter/helper
struct field and just have the caller understand that it's dropped on exit
(and document that).

Since you're just dropping the lock on entry, why not have the caller do
that and document that you have to enter unlocked anyway?


> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
>  	vma_start_write(vma);
> -	result = check_pmd_still_valid(mm, address, pmd);
> +	result = check_pmd_still_valid(mm, pmd_address, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>
>  	anon_vma_lock_write(vma->anon_vma);
> +	anon_vma_locked = true;

Again with a helper struct you can abstract this and avoid more noise.

E.g. scan_anon_vma_lock_write(scan);

>
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> +				start_addr + (PAGE_SIZE << order));

I hate this open-coded 'start_addr + (PAGE_SIZE << order)' construct.

If you use a helper struct (theme here :) you could have a macro that
generates it set an end param to this.


>  	mmu_notifier_invalidate_range_start(&range);
>
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1238,24 +1250,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> +						      order, &compound_pagelist);

Will this work correctly with the non-PMD aligned start_addr?

>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_NO_PTE_TABLE;
>  	}
>
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);
>  		spin_lock(pmd_ptl);
>  		BUG_ON(!pmd_none(*pmd));

Can we downgrade to WARN_ON_ONCE() as we pass by any BUG_ON()'s please?
Since we're churning here anyway it's worth doing :)

>  		/*
> @@ -1265,21 +1274,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse we must hold the lock

This is really unclear. What does 'can't run anymore' mean? Why must we
hold the lock for mTHP?

I realise the previous comment was equally as unclear but let's make this
make sense please :)

>  	 */
> -	anon_vma_unlock_write(vma->anon_vma);
> +	if (is_pmd_order(order)) {
> +		anon_vma_unlock_write(vma->anon_vma);
> +		anon_vma_locked = false;
> +	}
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   HPAGE_PMD_ORDER,
> -					   &compound_pagelist);
> -	pte_unmap(pte);
> +					   vma, start_addr, pte_ptl,
> +					   order, &compound_pagelist);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>
> @@ -1289,20 +1298,34 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> +	if (is_pmd_order(order)) { /* PMD collapse */

At this point we still hold the pte lock, is that intended? Are we sure
there won't be any issues leaving it held during the operations that now
happen before you release it?

> +		pgtable = pmd_pgtable(_pmd);
>
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> +		spin_lock(pmd_ptl);
> +		WARN_ON_ONCE(!pmd_none(*pmd));
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);

If we're PMD order start_addr == pmd_address right?

> +	} else { /* mTHP collapse */
> +		spin_lock(pmd_ptl);
> +		WARN_ON_ONCE(!pmd_none(*pmd));

You duplicate both of these lines in both branches, pull them out?

> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */

It'd be much nicer to call pmd_install() :)

Or maybe even to separate out the unlocked bit from pmd_install(), put that
in e.g. __pmd_install(), then use that after lock acquired?

> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +	}
>  	spin_unlock(pmd_ptl);
>
>  	folio = NULL;

Not your code but... why? I guess to avoid the folio_put() below but
gross. Anyway this function needs refactoring, can be a follow up.

>
>  	result = SCAN_SUCCEED;
>  out_up_write:
> +	if (anon_vma_locked)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (pte)
> +		pte_unmap(pte);

Again can be helped with helper struct :)

>  	mmap_write_unlock(mm);
> +	*mmap_locked = false;

And this... I also hate the break from if (*mmap_locked) ... etc.

>  out_nolock:
> +	WARN_ON_ONCE(*mmap_locked);

Should be a VM_WARN_ON_ONCE() if we keep it.

>  	if (folio)
>  		folio_put(folio);
>  	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> @@ -1483,9 +1506,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc);
> -		/* collapse_huge_page will return with the mmap_lock released */

Hm except this is true :) We also should probably just unlock before
entering as mentioned before.

> -		*mmap_locked = false;
> +					    unmapped, cc, mmap_locked,
> +					    HPAGE_PMD_ORDER);
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.53.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
  2026-02-26  3:25 ` [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
  2026-03-12 21:03   ` David Hildenbrand (Arm)
@ 2026-03-17 17:05   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 17:05 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, Feb 25, 2026 at 08:25:04PM -0700, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
>
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> 	PTEs
>
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
>
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
>   	PTEs
>
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
>
> As we currently dont support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 16 ++++++++++++---
>  4 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..eebb1f6bbc6c 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,30 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
> +       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
> +       emit a warning and no mTHP collapse will be attempted. khugepaged will

It's weird to document this here but not elsewhere in the document? I mean I
made this comment on the documentation patch also.

Not sure if I missed you adding it to another bit of the docs? :)

> +       try to collapse to the largest enabled (m)THP size; if it fails, it will
> +       try the next lower enabled mTHP size. This counter records the number of
> +       times a collapse attempt was skipped for exceeding the max_ptes_none
> +       threshold, and khugepaged will move on to the next available mTHP size.
> +
> +collapse_exceed_swap_pte
> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> +       to containing at least one swap PTE. Currently khugepaged does not
> +       support collapsing mTHP regions that contain a swap PTE. This counter can
> +       be used to monitor the number of khugepaged mTHP collapses that failed
> +       due to the presence of a swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> +       to containing at least one shared PTE. Currently khugepaged does not
> +       support collapsing mTHP PTE ranges that contain a shared PTE. This
> +       counter can be used to monitor the number of khugepaged mTHP collapses
> +       that failed due to the presence of a shared PTE.

All of these talk about 'ranges' that could be of any size. Are these useful
metrics? Counting a bunch of failures and not knowing if they are 256 KB
failures or 16 KB failures or whatever is maybe not so useful information?

Also, from the code, aren't you treating PMD events the same as mTHP ones from
the point of view of these counters? Maybe worth documenting that?

> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9941fc6d7bd8..e8777bb2347d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 228f35e962b9..1049a207a257 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -642,6 +642,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);

Is there a reason there's such a difference between the names and the actual
enum names?

> +
>
>  static struct attribute *anon_stats_attrs[] = {
>  	&anon_fault_alloc_attr.attr,
> @@ -658,6 +662,9 @@ static struct attribute *anon_stats_attrs[] = {
>  	&split_deferred_attr.attr,
>  	&nr_anon_attr.attr,
>  	&nr_anon_partially_mapped_attr.attr,
> +	&collapse_exceed_swap_pte_attr.attr,
> +	&collapse_exceed_none_pte_attr.attr,
> +	&collapse_exceed_shared_pte_attr.attr,
>  	NULL,
>  };
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c739f26dd61e..a6cf90e09e4a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -595,7 +595,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);

It's a bit gross to have separate stats for both thp and mthp but maybe
unavoidable from a legacy stand point.

Why are we dropping the _PTE suffix?

>  				goto out;
>  			}
>  		}
> @@ -631,10 +633,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			 * shared may cause a future higher order collapse on a
>  			 * rescan of the same range.
>  			 */
> -			if (!is_pmd_order(order) || (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared)) {

OK losing track here :) as the series sadly doesn't currently apply so can't
browser file as is.

In the code I'm looking at, there's also a ++shared here that I guess another
patch removed?

Is this in the folio_maybe_mapped_shared() branch?

> +			if (!is_pmd_order(order)) {
> +				result = SCAN_EXCEED_SHARED_PTE;
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +				goto out;
> +			}
> +
> +			if (cc->is_khugepaged &&
> +			    shared > khugepaged_max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;

Anyway I'm a bit lost on this logic until a respin but this looks like a LOT of
code duplication. I see David alluded to a refactoring so maybe what he suggests
will help (not had a chance to check what it is specifically :P)

>  			}
>  		}
> @@ -1081,6 +1090,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (!is_pmd_order(order)) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);

Hmm I thought we were incrementing mthp stats for pmd sized also?

>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> --
> 2.53.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function
  2026-02-26  3:25 ` [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
  2026-03-12 21:09   ` David Hildenbrand (Arm)
@ 2026-03-17 17:08   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 17:08 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, Feb 25, 2026 at 08:25:42PM -0700, Nico Pache wrote:
> Add collapse_allowable_orders() to generalize THP order eligibility. The
> function determines which THP orders are permitted based on collapse
> context (khugepaged vs madv_collapse).
>
> This consolidates collapse configuration logic and provides a clean
> interface for future mTHP collapse support where the orders may be
> different.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2e66d660ee8e..2fdfb6d42cf9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -486,12 +486,22 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
>  	return -EINVAL;
>  }
>
> +/* Check what orders are allowed based on the vma and collapse type */
> +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> +			vm_flags_t vm_flags, bool is_khugepaged)

You're always passing vma->vm_flags, maybe just pass vma and you can grab
vma->vm_flags?

Really it would be better for it to be &vma->flags, but probably best to wait
for me to do a follow up VMA flags series for that.

> +{
> +	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> +	unsigned long orders = BIT(HPAGE_PMD_ORDER);

Const?

Also not sure if we decided BIT() was right here or not :P For me fine though.

> +
> +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +}
> +
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))

I agree with David, let's pass through the enum value please :)

>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -2637,7 +2647,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
>  			progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/true)) {
>  			progress++;
>  			continue;
>  		}
> @@ -2949,7 +2959,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	BUG_ON(vma->vm_start > start);
>  	BUG_ON(vma->vm_end < end);
>
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> +	if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/false))
>  		return -EINVAL;
>
>  	cc = kmalloc_obj(*cc);
> --
> 2.53.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-03-17 16:51   ` Lorenzo Stoakes (Oracle)
@ 2026-03-17 17:16     ` Randy Dunlap
  0 siblings, 0 replies; 45+ messages in thread
From: Randy Dunlap @ 2026-03-17 17:16 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, richard.weiyang, rientjes, rostedt, rppt, ryan.roberts,
	shivankg, sunnanyong, surenb, thomas.hellstrom, tiwai,
	usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will, willy,
	yang, ying.huang, ziy, zokeefe



On 3/17/26 9:51 AM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Feb 25, 2026 at 08:24:27PM -0700, Nico Pache wrote:
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> The '--' seems weird here 🙂 maybe meant to be ' - '?

"--" is common typewriter(!) style for "dash".
Single "-" is a hyphen.

-- 
~Randy



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support
  2026-02-26  3:26 ` [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support Nico Pache
  2026-03-12 21:16   ` David Hildenbrand (Arm)
@ 2026-03-17 21:36   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 21:36 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, Feb 25, 2026 at 08:26:05PM -0700, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, a alloc phase (mmap unlocked), then finally heavier collapse

-> an alloc phase

> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.

Right bu reinstate it later right? :) Though now simpler as we have only two
modes.

>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. The

Maybe worth saying due to limited kernel stack size.

> algorithm recursively splits the bitmap into smaller chunks to find the
> highest order mTHPs that satisfy the collapse criteria. We start by
> attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of

Probably worth saying the stack is kept in the collapse_control now?

Having nitted all this, thanks for writing this up much appreciated :)

> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
>                            offset       mid_offset
>                             |         |
>                             |         |
>                             v         v
>           __________________^_________________ <-- I think better as -
>          |          PTE Page.Table            |
>          -------------------.------------------
> 			    <-.-----><------->
>                           ^ .order-1  order- <-- not sure
                            . .            if I accidentally
                            . .       deleted or missing :P
                            . .
			    . .. Doesn't line up with..|
			    .this......................|
                                       ^
                                       |
                                       |
That's nice, but as a connoisseur of the ASCII diagram, a few nits :)

I also wonder if the offset, mid-offset is correct to put there?

Because you start off with:

(0, HPAGE_PMD_ORDER)

       offset
          |
          |
          v
          |------------------------------------|
          |                 PTE                |
          |------------------------------------|
	  <------------------------------------>
	                1 << order

Right? Trying to get the PMD sized one, then:

(offset=0, order=HPAGE_PMD_ORDER -1)
(mid_offset=HPAGE_PMD_NR >> 1, order=HPAGE_PMD_ORDER -1)

       offset           mid_offset
          |                  |
          |                  |
          v                  v
          |------------------------------------|
          |                 PTE                |
          |------------------------------------|
	  <------------------><---------------->
	       1 << order          1 << order

And etc.

So probably worth making that clear.

>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order

Probably worth slightly expanding on what introducing 'new empty pages'
entails in practice here.

>
> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> attempts. There should be no behavior change for PMD collapse.

Maybe worth saying we are doing this for the time being as it avoids issues
with the algorithm tending towards PMD collapse etc.?

>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order (m)THP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will

I think it's confusing mentioning this here, probably better to move it
uabove.

> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.

Overall a great commit message and THANK YOU so much for that excellent
description of the algorithm, much appreciated!

>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 189 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 180 insertions(+), 9 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2fdfb6d42cf9..1c3711ed4513 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,32 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * The maximum number of mTHP ranges that can be stored on the stack.
> + * This is calculated based on the number of PTE entries in a PTE page table
> + * and the minimum mTHP order.
> + *
> + * ilog2(MAX_PTRS_PER_PTE) is log2 of the maximum number of PTE entries.

I think this line is superfluous and can be removed :)

> + * This gives you the PMD_ORDER, and is needed in place of HPAGE_PMD_ORDER due
> + * to restrictions of some architectures (ie ppc64le).

Hm this is vague, why exactly?

> + *
> + * At most there will be 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER) mTHP ranges

Maybe worth moving this around and saying 'the absolute most number of mTHP
ranges we can encounter is 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER),
but due to some architectures restricting the number of pointers per PTE,
we have to derive the actual maximum from MAX_PTRS_PER_PTE'

And then add a paragraph on why arches do this (if it's just PPC, just
replace arches with PPC) etc.

> + */
> +#define MTHP_STACK_SIZE	(1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))

I think it'd be nice to have a separate define for ilog2(MAX_PTRS_PER_PTE),
maybe MAX_ORDER_PER_PTE?

> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for (m)THP collapse.

Probably we can drop the parens :)

> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>
> @@ -107,6 +133,11 @@ struct collapse_control {
>
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* bitmap used for mTHP collapse */

This is still super vague. Also you have 2 bitmaps here :)

Something like:

	/* Each bit set represents a present PTE entry  */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
	/* A mask of the current range being considered for mTHP collapse */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];

This time no heart attack as I know this is not on the stack (despite a
misunderstanding on a series where somebody did seem to want to try to do
that recently to reintroduce a coronary event :P)

>  };
>
>  /**
> @@ -1361,17 +1392,138 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>
> +static void mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				   u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range mthp_stack_pop(struct collapse_control *cc, int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}


I know I love helper structs, but I wonder if we chouldn't have a broader
stack object like:

struct mthp_stack_state {
	struct mthp_range *arr; // assigned to cc->mthp_bitmap-stack
	int size;
};

And just make these functions like:

static void mthp_stack_push(struct mthp_stack_state *stack)
{
	struct mthp_range *range = &stack->arr[stack->size++];

	VM_WARN_ON_ONCE(stack->size >= MTHP_STACK_SIZE);
	range->order = order;
	range->offset = offset;
}

static struct mthp_range mthp_stack_pop(struct mthp_stack_state *stack)
{
	struct mthp_range *range = &stack->arr[--stack->size];

	VM_WARN_ON_ONCE(stack->size <= 0);
	return *range;
}

I also fold some other cleanups into that, I think e.g. *stack-size--,
followed by using size (which we copied first) - 1 indexed into the array
is overwrought.

> +
> +static unsigned int mthp_nr_occupied_pte_entries(struct collapse_control *cc,

This name is a bit overwrought. pte_count_present()? I think we maybe don't
even need a mthp_ prefix given the params should give context?

> +						 u16 offset, unsigned long nr_pte_entries)

This line is super long, can we just put the 2nd line 2 tabs indented?

> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);

Why are we zeroing HPAGE_PMD_NR bits, but mthp_bitmap_mask has
MAX_PTRS_PER_PTE entries?

I thought the comment above was saying how MAX_PTRS_PER_PTE can be <
HPAGE_PMD_NR right? So couldn't this be problematic on ppc?

We should be zeroing the same number of bits as is defined for the bitmap.

> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_pte_entries);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, HPAGE_PMD_NR);
> +}

We could pass in a stack_state pointer if we add pointers to
cc->mthp_bitmap_mask and cc->mthp_bitmap to the struct:

static unsigned int pte_count_present_top(struct stack_state *stack)
{
	struct mthp_range *range = stack->arr[stack->size - 1];

	VM_WARN_ON_ONCE(!stack->size);

	/* set up mask for offset, range */
	bitmap_zero(stack->mask, MAX_PTRS_PER_PTE);
	bitmap_set(stack->mask, arr->offset, 1U << arr->order);

	/* Hamming weight of mask & bitmap = count of PTE entries in range. */
	return bitmap_weight_and(stack->bitmap, stack->mask, MAX_PTRS_PER_PTE);
}

Which could also simplify the calling code a lot, which then could pop
afterwards, leaving this to read the top of the stack.

Then again, you use order, nr_pte_entries elsewhere so could be:

static unsigned int pte_count_present(struct stack_state *stack,
		struct mthp_range *range) { ... }

Instead?

> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.

Probably worth dropping the (!none/zero) bit for clarity.

> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> +
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.

So this is pretty much the same as the commit message and obviously my
comments are the same here as for there :)

But again thanks for doing this!

> + */
> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	unsigned int max_ptes_none, nr_occupied_ptes;
> +	struct mthp_range range;
> +	unsigned long collapse_address;
> +	int collapsed = 0, stack_size = 0;
> +	unsigned long nr_pte_entries;
> +	u16 offset;
> +	u8 order;

I hate to now be the one saying it since somebody got me OCD about it
before, and it doesn't really matter and is a nit, but like reverse xmas
tree would be nice :P

> +
> +	mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size > 0) {

I think we can be kernel-y and make this:

while (stack_size)

As going < 0 would be a bug anyway right?

> +		range = mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_pte_entries = 1UL << order;

See above idea for using stack state type and just reading off top of stack
in calculation.

> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		if (cc->is_khugepaged)
> +			max_ptes_none = collapse_max_ptes_none(order);
> +		else
> +			max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;

Hm, should we even be executing this loop at all for MADV_COLLAPSE? Could
we just separate that out as its own thing that just does a PMD-sized entry
and simplify this?

But then hmm you use order, nr_pte_entries elsewhere so maybe could just
pop range and pass that in with stack_state ptr?

> +
> +		if (max_ptes_none == -EINVAL)

Shouldn't we rather do something like IS_ERR_VALUE(max_ptes_none)?

> +			return collapsed;

> +
> +		nr_occupied_ptes = mthp_nr_occupied_pte_entries(cc, offset, nr_pte_entries);
> +
> +		if (nr_occupied_ptes >= nr_pte_entries - max_ptes_none) {

Be nicer to have nr_ptes_entries - max_ptes_none pre-calculated as a value,
e.g. min_occupied_ptes?

> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, mmap_locked,
> +						 order);

My kingdom for a helper struct here :)

> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_pte_entries;
> +				continue;
> +			}

I guess we don't care about which flavour of scan failure happens here?

> +		}
> +
> +next_order:
> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_pte_entries / 2);
> +
> +			mthp_stack_push(cc, &stack_size, mid_offset, next_order);
> +			mthp_stack_push(cc, &stack_size, offset, next_order);

All this could be a helper function, like:

static void push_next_order_range(struct stack_state *stack, u8 order,
		u16 offset)
{
	const u8 next_order = order - 1;

	mthp_stack_push(stack, offset, next_order);
	offset += (1 << next_order);
	mthp_stack_push(stack, offset, next_order);
}

> +		}
> +	}
> +	return collapsed;
> +}

Overall MUCH more understandable thanks for that!

> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
>  		unsigned int *cur_progress, struct collapse_control *cc)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  	struct page *page = NULL;
> +	unsigned int max_ptes_none;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
> +	unsigned long enabled_orders;

Kinda hate how much state we're putting here throughout function scope, but
can address that in follow up cleanups I guses.

>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;

>
> @@ -1384,8 +1536,21 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> +	bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR);

Again, shouldn't this be MAX_PTRS_PER_PTE?

Be nicer to separate into a helper function esp if you have a stack_state
object, to initialise it separately rather than having a random open-coded
bitmap_zero() here.

>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged);

Too long line :) please keep to max 80, with some small exceptions going
over by like 1 or 2 chars.

> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */
> +	if (cc->is_khugepaged && enabled_orders == BIT(HPAGE_PMD_ORDER))

Isn't BIT(HPAGE_PMD_ORDER) the same as HPAGE_PMD_NR?

> +		max_ptes_none = collapse_max_ptes_none(HPAGE_PMD_ORDER);
> +	else
> +		max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
> +

Coudl we separate this function into a helper struct please, rather than
piling on more open coded stuff?

>  	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>  	if (!pte) {
>  		if (cur_progress)
> @@ -1394,17 +1559,18 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;

(Still hate this underscore b.s. but it's legacy stuff we should deal with
separately)

> +		addr = start_addr + i * PAGE_SIZE;
> +		pte_t pteval = ptep_get(_pte);

Err you're declaring a type under 2 just-assignments? That's not kernel
style :)

Should be:

<type decls>
newline
<everything else>

> +
>  		if (cur_progress)
>  			*cur_progress += 1;
>
> -		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||

Why are we dropping this?

> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			    none_or_zero <= max_ptes_none) {
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1478,6 +1644,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>
> +		/* Set bit for occupied pages */
> +		bitmap_set(cc->mthp_bitmap, i, 1);

Again, let's use a helper for this kind of thing rather than open
coding. The stack_state object will help.

>  		/*
>  		 * Record which node the original page is from and save this
>  		 * information to cc->node_load[].
> @@ -1534,9 +1702,12 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, mmap_locked,
> -					    HPAGE_PMD_ORDER);
> +		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
> +					      cc, mmap_locked, enabled_orders);
> +		if (nr_collapsed > 0)

if (nr_collapsed) is more kernelly :)

> +			result = SCAN_SUCCEED;
> +		else
> +			result = SCAN_FAIL;

I mean maybe we can just use a ?: assignment here like:

result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;

?

>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.53.0
>

Overall we're much much closer now to having this series in, sorry for
delays etc. in review but now I've (finally) got to review everything and
David has had a look too on both series we are honing in on completion!

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-03-17 10:35   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 18:59     ` Nico Pache
  2026-03-18 19:48       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-03-18 18:59 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), david
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe



On 3/17/26 4:35 AM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Feb 25, 2026 at 08:26:31PM -0700, Nico Pache wrote:
>> There are cases where, if an attempted collapse fails, all subsequent
>> orders are guaranteed to also fail. Avoid these collapse attempts by
>> bailing out early.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> With David's concern addressed:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> 
>> ---
>>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>>  1 file changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 1c3711ed4513..388d3f2537e2 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>>  			ret = collapse_huge_page(mm, collapse_address, referenced,
>>  						 unmapped, cc, mmap_locked,
>>  						 order);
>> -			if (ret == SCAN_SUCCEED) {
>> +
>> +			switch (ret) {
>> +			/* Cases were we continue to next collapse candidate */
>> +			case SCAN_SUCCEED:
>>  				collapsed += nr_pte_entries;
>> +				fallthrough;
>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>>  				continue;
>> +			/* Cases were lower orders might still succeed */
>> +			case SCAN_LACK_REFERENCED_PAGE:
>> +			case SCAN_EXCEED_NONE_PTE:
>> +			case SCAN_EXCEED_SWAP_PTE:
>> +			case SCAN_EXCEED_SHARED_PTE:
>> +			case SCAN_PAGE_LOCK:
>> +			case SCAN_PAGE_COUNT:
>> +			case SCAN_PAGE_LRU:
>> +			case SCAN_PAGE_NULL:
>> +			case SCAN_DEL_PAGE_LRU:
>> +			case SCAN_PTE_NON_PRESENT:
>> +			case SCAN_PTE_UFFD_WP:
>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
>> +				goto next_order;
>> +			/* Cases were no further collapse is possible */
>> +			case SCAN_CGROUP_CHARGE_FAIL:
>> +			case SCAN_COPY_MC:
>> +			case SCAN_ADDRESS_RANGE:
>> +			case SCAN_NO_PTE_TABLE:
>> +			case SCAN_ANY_PROCESS:
>> +			case SCAN_VMA_NULL:
>> +			case SCAN_VMA_CHECK:
>> +			case SCAN_SCAN_ABORT:
>> +			case SCAN_PAGE_ANON:
>> +			case SCAN_PMD_MAPPED:
>> +			case SCAN_FAIL:
>> +			default:
> 
> Agree with david, let's spell them out please :)

I believe David is arguing for the opposite. To drop all these spelt out cases
and just leave the default case.

@david is that correct or did I misunderstand that.

-- Nico

> 
>> +				return collapsed;
>>  			}
>>  		}
>>
>> --
>> 2.53.0
>>
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
  2026-03-17 10:58   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 19:02     ` Nico Pache
  0 siblings, 0 replies; 45+ messages in thread
From: Nico Pache @ 2026-03-18 19:02 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe



On 3/17/26 4:58 AM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Feb 25, 2026 at 08:26:50PM -0700, Nico Pache wrote:
>> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> If any order (m)THP is enabled we should allow running khugepaged to
>> attempt scanning and collapsing mTHPs. In order for khugepaged to operate
>> when only mTHP sizes are specified in sysfs, we must modify the predicate
>> function that determines whether it ought to run to do so.
>>
>> This function is currently called hugepage_pmd_enabled(), this patch
>> renames it to hugepage_enabled() and updates the logic to check to
>> determine whether any valid orders may exist which would justify
>> khugepaged running.
>>
>> We must also update collapse_allowable_orders() to check all orders if
>> the vma is anonymous and the collapse is khugepaged.
>>
>> After this patch khugepaged mTHP collapse is fully enabled.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> This looks good to me, so:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Thanks!

> 
>> ---
>>  mm/khugepaged.c | 30 ++++++++++++++++++------------
>>  1 file changed, 18 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 388d3f2537e2..e8bfcc1d0c9a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -434,23 +434,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>>  		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>>  }
>>
>> -static bool hugepage_pmd_enabled(void)
>> +static bool hugepage_enabled(void)
>>  {
>>  	/*
>>  	 * We cover the anon, shmem and the file-backed case here; file-backed
>>  	 * hugepages, when configured in, are determined by the global control.
>> -	 * Anon pmd-sized hugepages are determined by the pmd-size control.
>> +	 * Anon hugepages are determined by its per-size mTHP control.
> 
> Well also PMD right? I mean this terminology sucks because in a sense mTHP
> includes PMD... :)

yeah kinda hard with our verbiage being so broad and overlapping some times.

> 
>>  	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
>>  	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
>>  	 */
>>  	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>>  	    hugepage_global_enabled())
>>  		return true;
>> -	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
>> +	if (READ_ONCE(huge_anon_orders_always))
>>  		return true;
>> -	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
>> +	if (READ_ONCE(huge_anon_orders_madvise))
>>  		return true;
>> -	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
>> +	if (READ_ONCE(huge_anon_orders_inherit) &&
>>  	    hugepage_global_enabled())
>>  		return true;
>>  	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
>> @@ -521,8 +521,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
>>  static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
>>  			vm_flags_t vm_flags, bool is_khugepaged)
>>  {
>> +	unsigned long orders;
>>  	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>> -	unsigned long orders = BIT(HPAGE_PMD_ORDER);
>> +
>> +	/* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
>> +	if (is_khugepaged && vma_is_anonymous(vma))
>> +		orders = THP_ORDERS_ALL_ANON;
>> +	else
>> +		orders = BIT(HPAGE_PMD_ORDER);
>>
>>  	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>>  }
>> @@ -531,7 +537,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>>  			  vm_flags_t vm_flags)
>>  {
>>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>> -	    hugepage_pmd_enabled()) {
>> +	    hugepage_enabled()) {
>>  		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
>>  			__khugepaged_enter(vma->vm_mm);
>>  	}
>> @@ -2929,7 +2935,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
>>
>>  static int khugepaged_has_work(void)
>>  {
>> -	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
>> +	return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
>>  }
>>
>>  static int khugepaged_wait_event(void)
>> @@ -3002,7 +3008,7 @@ static void khugepaged_wait_work(void)
>>  		return;
>>  	}
>>
>> -	if (hugepage_pmd_enabled())
>> +	if (hugepage_enabled())
>>  		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
>>  }
>>
>> @@ -3033,7 +3039,7 @@ static void set_recommended_min_free_kbytes(void)
>>  	int nr_zones = 0;
>>  	unsigned long recommended_min;
>>
>> -	if (!hugepage_pmd_enabled()) {
>> +	if (!hugepage_enabled()) {
>>  		calculate_min_free_kbytes();
>>  		goto update_wmarks;
>>  	}
>> @@ -3083,7 +3089,7 @@ int start_stop_khugepaged(void)
>>  	int err = 0;
>>
>>  	mutex_lock(&khugepaged_mutex);
>> -	if (hugepage_pmd_enabled()) {
>> +	if (hugepage_enabled()) {
>>  		if (!khugepaged_thread)
>>  			khugepaged_thread = kthread_run(khugepaged, NULL,
>>  							"khugepaged");
>> @@ -3109,7 +3115,7 @@ int start_stop_khugepaged(void)
>>  void khugepaged_min_free_kbytes_update(void)
>>  {
>>  	mutex_lock(&khugepaged_mutex);
>> -	if (hugepage_pmd_enabled() && khugepaged_thread)
>> +	if (hugepage_enabled() && khugepaged_thread)
>>  		set_recommended_min_free_kbytes();
>>  	mutex_unlock(&khugepaged_mutex);
>>  }
>> --
>> 2.53.0
>>
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
  2026-03-17 11:36   ` Lance Yang
@ 2026-03-18 19:07     ` Nico Pache
  0 siblings, 0 replies; 45+ messages in thread
From: Nico Pache @ 2026-03-18 19:07 UTC (permalink / raw)
  To: Lance Yang, baolin.wang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe



On 3/17/26 5:36 AM, Lance Yang wrote:
> 
> On Wed, Feb 25, 2026 at 08:26:50PM -0700, Nico Pache wrote:
>> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> If any order (m)THP is enabled we should allow running khugepaged to
>> attempt scanning and collapsing mTHPs. In order for khugepaged to operate
>> when only mTHP sizes are specified in sysfs, we must modify the predicate
>> function that determines whether it ought to run to do so.
>>
>> This function is currently called hugepage_pmd_enabled(), this patch
>> renames it to hugepage_enabled() and updates the logic to check to
>> determine whether any valid orders may exist which would justify
>> khugepaged running.
>>
>> We must also update collapse_allowable_orders() to check all orders if
>> the vma is anonymous and the collapse is khugepaged.
>>
>> After this patch khugepaged mTHP collapse is fully enabled.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 30 ++++++++++++++++++------------
>> 1 file changed, 18 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 388d3f2537e2..e8bfcc1d0c9a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -434,23 +434,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>> 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>> }
>>
>> -static bool hugepage_pmd_enabled(void)
>> +static bool hugepage_enabled(void)
>> {
>> 	/*
>> 	 * We cover the anon, shmem and the file-backed case here; file-backed
>> 	 * hugepages, when configured in, are determined by the global control.
>> -	 * Anon pmd-sized hugepages are determined by the pmd-size control.
>> +	 * Anon hugepages are determined by its per-size mTHP control.
>> 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
>> 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
>> 	 */
>> 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>> 	    hugepage_global_enabled())
>> 		return true;
>> -	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
>> +	if (READ_ONCE(huge_anon_orders_always))
>> 		return true;
>> -	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
>> +	if (READ_ONCE(huge_anon_orders_madvise))
>> 		return true;
>> -	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
>> +	if (READ_ONCE(huge_anon_orders_inherit) &&
>> 	    hugepage_global_enabled())
>> 		return true;
>> 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
>> @@ -521,8 +521,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
>> static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
>> 			vm_flags_t vm_flags, bool is_khugepaged)
>> {
>> +	unsigned long orders;
>> 	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>> -	unsigned long orders = BIT(HPAGE_PMD_ORDER);
>> +
>> +	/* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
>> +	if (is_khugepaged && vma_is_anonymous(vma))
>> +		orders = THP_ORDERS_ALL_ANON;
>> +	else
>> +		orders = BIT(HPAGE_PMD_ORDER);
>>
>> 	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>> }
> 
> IIUC, an anonymous VMA can pass collapse_allowable_orders() even if it
> is smaller than 2MB ...
> 
> But collapse_scan_mm_slot() still scans only full PMD-sized windows:
> 
> 		hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
> 		hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
> 		if (khugepaged_scan.address > hend) {
> 			cc->progress++;
> 			continue;
> 		}
> 
> and hugepage_vma_revalidate() still requires PMD suitability:
> 
> 	/* Always check the PMD order to ensure its not shared by another VMA */
> 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> 		return SCAN_ADDRESS_RANGE;
> 
> 
>> @@ -531,7 +537,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>> 			  vm_flags_t vm_flags)
>> {
>> 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>> -	    hugepage_pmd_enabled()) {
>> +	    hugepage_enabled()) {
>> 		if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
>> 			__khugepaged_enter(vma->vm_mm);
> 
> I wonder if we should also require at least one PMD-sized scan window
> here? Not a big deal, just might be good to tighten the gate a bit :)

IIUC, you are worried that we are operating on VMAs smaller than a PMD?
thp_vma_allowable_orders should guard from that via thp_vma_suitable. the
revalidation also checks this in hugepage_vma_revalidate() and is the reason we
must leave the suitable_order check in revalidate() checking the PMD_ORDER than
than the attempted collapse order.

lmk if that clears things up!

Thanks
-- Nico

> 
> Apart from that, LGTM!
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse
  2026-03-17 11:02   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 19:08     ` Nico Pache
  2026-03-18 19:49       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 45+ messages in thread
From: Nico Pache @ 2026-03-18 19:08 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), david
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Bagas Sanjaya



On 3/17/26 5:02 AM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Feb 25, 2026 at 08:27:06PM -0700, Nico Pache wrote:
>> Now that we can collapse to mTHPs lets update the admin guide to
>> reflect these changes and provide proper guidance on how to utilize it.
>>
>> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> LGTM, but maybe we should mention somewhere about mTHP's max_ptes_none
> behaviour?

IIRC we decided to strictly leave that out of the manual. I used to have it in
here. @david?

> 
> Anyway with that addressed:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> 
>> ---
>>  Documentation/admin-guide/mm/transhuge.rst | 48 +++++++++++++---------
>>  1 file changed, 28 insertions(+), 20 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index eebb1f6bbc6c..67836c683e8d 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -63,7 +63,8 @@ often.
>>  THP can be enabled system wide or restricted to certain tasks or even
>>  memory ranges inside task's address space. Unless THP is completely
>>  disabled, there is ``khugepaged`` daemon that scans memory and
>> -collapses sequences of basic pages into PMD-sized huge pages.
>> +collapses sequences of basic pages into huge pages of either PMD size
>> +or mTHP sizes, if the system is configured to do so.
>>
>>  The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
>>  interface and using madvise(2) and prctl(2) system calls.
>> @@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
>>  	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>>  	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>>
>> -khugepaged will be automatically started when PMD-sized THP is enabled
>> +khugepaged will be automatically started when any THP size is enabled
>>  (either of the per-size anon control or the top-level control are set
>>  to "always" or "madvise"), and it'll be automatically shutdown when
>> -PMD-sized THP is disabled (when both the per-size anon control and the
>> +all THP sizes are disabled (when both the per-size anon control and the
>>  top-level control are "never")
>>
>>  process THP controls
>> @@ -264,11 +265,6 @@ support the following arguments::
>>  Khugepaged controls
>>  -------------------
>>
>> -.. note::
>> -   khugepaged currently only searches for opportunities to collapse to
>> -   PMD-sized THP and no attempt is made to collapse to other THP
>> -   sizes.
>> -
>>  khugepaged runs usually at low frequency so while one may not want to
>>  invoke defrag algorithms synchronously during the page faults, it
>>  should be worth invoking defrag at least in khugepaged. However it's
>> @@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
>>  The khugepaged progress can be seen in the number of pages collapsed (note
>>  that this counter may not be an exact count of the number of pages
>>  collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
>> -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
>> -one 2M hugepage. Each may happen independently, or together, depending on
>> -the type of memory and the failures that occur. As such, this value should
>> -be interpreted roughly as a sign of progress, and counters in /proc/vmstat
>> -consulted for more accurate accounting)::
>> +being replaced by a PMD mapping, or (2) physical pages replaced by one
>> +hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
>> +or together, depending on the type of memory and the failures that occur.
>> +As such, this value should be interpreted roughly as a sign of progress,
>> +and counters in /proc/vmstat consulted for more accurate accounting)::
>>
>>  	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
>>
>> @@ -308,16 +304,19 @@ for each pass::
>>
>>  	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
>>
>> -``max_ptes_none`` specifies how many extra small pages (that are
>> -not already mapped) can be allocated when collapsing a group
>> -of small pages into one large page::
>> +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
>> +when collapsing a group of small pages into one large page::
>>
>>  	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>
>> -A higher value leads to use additional memory for programs.
>> -A lower value leads to gain less thp performance. Value of
>> -max_ptes_none can waste cpu time very little, you can
>> -ignore it.
>> +For PMD-sized THP collapse, this directly limits the number of empty pages
>> +allowed in the 2MB region. For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1)
>> +are supported. Any other value will emit a warning and no mTHP collapse
>> +will be attempted.
>> +
>> +A higher value allows more empty pages, potentially leading to more memory
>> +usage but better THP performance. A lower value is more conservative and
>> +may result in fewer THP collapses.
>>
>>  ``max_ptes_swap`` specifies how many pages can be brought in from
>>  swap when collapsing a group of pages into a transparent huge page::
>> @@ -337,6 +336,15 @@ that THP is shared. Exceeding the number would block the collapse::
>>
>>  A higher value may increase memory footprint for some workloads.
>>
>> +.. note::
>> +   For mTHP collapse, khugepaged does not support collapsing regions that
>> +   contain shared or swapped out pages, as this could lead to continuous
>> +   promotion to higher orders. The collapse will fail if any shared or
>> +   swapped PTEs are encountered during the scan.
>> +
>> +   Currently, madvise_collapse only supports collapsing to PMD-sized THPs
>> +   and does not attempt mTHP collapses.
>> +
>>  Boot parameters
>>  ===============
>>
>> --
>> 2.53.0
>>
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-03-18 18:59     ` Nico Pache
@ 2026-03-18 19:48       ` David Hildenbrand (Arm)
  2026-03-19 15:59         ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18 19:48 UTC (permalink / raw)
  To: Nico Pache, Lorenzo Stoakes (Oracle)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 3/18/26 19:59, Nico Pache wrote:
> 
> 
> On 3/17/26 4:35 AM, Lorenzo Stoakes (Oracle) wrote:
>> On Wed, Feb 25, 2026 at 08:26:31PM -0700, Nico Pache wrote:
>>> There are cases where, if an attempted collapse fails, all subsequent
>>> orders are guaranteed to also fail. Avoid these collapse attempts by
>>> bailing out early.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>
>> With David's concern addressed:
>>
>> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>>
>>> ---
>>>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>>>  1 file changed, 34 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 1c3711ed4513..388d3f2537e2 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>>>  			ret = collapse_huge_page(mm, collapse_address, referenced,
>>>  						 unmapped, cc, mmap_locked,
>>>  						 order);
>>> -			if (ret == SCAN_SUCCEED) {
>>> +
>>> +			switch (ret) {
>>> +			/* Cases were we continue to next collapse candidate */
>>> +			case SCAN_SUCCEED:
>>>  				collapsed += nr_pte_entries;
>>> +				fallthrough;
>>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>>>  				continue;
>>> +			/* Cases were lower orders might still succeed */
>>> +			case SCAN_LACK_REFERENCED_PAGE:
>>> +			case SCAN_EXCEED_NONE_PTE:
>>> +			case SCAN_EXCEED_SWAP_PTE:
>>> +			case SCAN_EXCEED_SHARED_PTE:
>>> +			case SCAN_PAGE_LOCK:
>>> +			case SCAN_PAGE_COUNT:
>>> +			case SCAN_PAGE_LRU:
>>> +			case SCAN_PAGE_NULL:
>>> +			case SCAN_DEL_PAGE_LRU:
>>> +			case SCAN_PTE_NON_PRESENT:
>>> +			case SCAN_PTE_UFFD_WP:
>>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
>>> +				goto next_order;
>>> +			/* Cases were no further collapse is possible */
>>> +			case SCAN_CGROUP_CHARGE_FAIL:
>>> +			case SCAN_COPY_MC:
>>> +			case SCAN_ADDRESS_RANGE:
>>> +			case SCAN_NO_PTE_TABLE:
>>> +			case SCAN_ANY_PROCESS:
>>> +			case SCAN_VMA_NULL:
>>> +			case SCAN_VMA_CHECK:
>>> +			case SCAN_SCAN_ABORT:
>>> +			case SCAN_PAGE_ANON:
>>> +			case SCAN_PMD_MAPPED:
>>> +			case SCAN_FAIL:
>>> +			default:
>>
>> Agree with david, let's spell them out please :)
> 
> I believe David is arguing for the opposite. To drop all these spelt out cases
> and just leave the default case.
> 
> @david is that correct or did I misunderstand that.

Either spell all out (no default) OR add a default.

I prefer to just ... use the default :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse
  2026-03-18 19:08     ` Nico Pache
@ 2026-03-18 19:49       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18 19:49 UTC (permalink / raw)
  To: Nico Pache, Lorenzo Stoakes (Oracle)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Bagas Sanjaya

On 3/18/26 20:08, Nico Pache wrote:
> 
> 
> On 3/17/26 5:02 AM, Lorenzo Stoakes (Oracle) wrote:
>> On Wed, Feb 25, 2026 at 08:27:06PM -0700, Nico Pache wrote:
>>> Now that we can collapse to mTHPs lets update the admin guide to
>>> reflect these changes and provide proper guidance on how to utilize it.
>>>
>>> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>
>> LGTM, but maybe we should mention somewhere about mTHP's max_ptes_none
>> behaviour?
> 
> IIRC we decided to strictly leave that out of the manual. I used to have it in
> here. @david?

I think we argued in the past that we didn't want to document the weird
scaling part.

Documenting that only two values are currently supported (no scaling)
makes sense to me.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-03-18 19:48       ` David Hildenbrand (Arm)
@ 2026-03-19 15:59         ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 45+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 15:59 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On Wed, Mar 18, 2026 at 08:48:30PM +0100, David Hildenbrand (Arm) wrote:
> On 3/18/26 19:59, Nico Pache wrote:
> >
> >
> > On 3/17/26 4:35 AM, Lorenzo Stoakes (Oracle) wrote:
> >> On Wed, Feb 25, 2026 at 08:26:31PM -0700, Nico Pache wrote:
> >>> There are cases where, if an attempted collapse fails, all subsequent
> >>> orders are guaranteed to also fail. Avoid these collapse attempts by
> >>> bailing out early.
> >>>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>
> >> With David's concern addressed:
> >>
> >> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> >>
> >>> ---
> >>>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> >>>  1 file changed, 34 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 1c3711ed4513..388d3f2537e2 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >>>  			ret = collapse_huge_page(mm, collapse_address, referenced,
> >>>  						 unmapped, cc, mmap_locked,
> >>>  						 order);
> >>> -			if (ret == SCAN_SUCCEED) {
> >>> +
> >>> +			switch (ret) {
> >>> +			/* Cases were we continue to next collapse candidate */
> >>> +			case SCAN_SUCCEED:
> >>>  				collapsed += nr_pte_entries;
> >>> +				fallthrough;
> >>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
> >>>  				continue;
> >>> +			/* Cases were lower orders might still succeed */
> >>> +			case SCAN_LACK_REFERENCED_PAGE:
> >>> +			case SCAN_EXCEED_NONE_PTE:
> >>> +			case SCAN_EXCEED_SWAP_PTE:
> >>> +			case SCAN_EXCEED_SHARED_PTE:
> >>> +			case SCAN_PAGE_LOCK:
> >>> +			case SCAN_PAGE_COUNT:
> >>> +			case SCAN_PAGE_LRU:
> >>> +			case SCAN_PAGE_NULL:
> >>> +			case SCAN_DEL_PAGE_LRU:
> >>> +			case SCAN_PTE_NON_PRESENT:
> >>> +			case SCAN_PTE_UFFD_WP:
> >>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> >>> +				goto next_order;
> >>> +			/* Cases were no further collapse is possible */
> >>> +			case SCAN_CGROUP_CHARGE_FAIL:
> >>> +			case SCAN_COPY_MC:
> >>> +			case SCAN_ADDRESS_RANGE:
> >>> +			case SCAN_NO_PTE_TABLE:
> >>> +			case SCAN_ANY_PROCESS:
> >>> +			case SCAN_VMA_NULL:
> >>> +			case SCAN_VMA_CHECK:
> >>> +			case SCAN_SCAN_ABORT:
> >>> +			case SCAN_PAGE_ANON:
> >>> +			case SCAN_PMD_MAPPED:
> >>> +			case SCAN_FAIL:
> >>> +			default:
> >>
> >> Agree with david, let's spell them out please :)
> >
> > I believe David is arguing for the opposite. To drop all these spelt out cases
> > and just leave the default case.
> >
> > @david is that correct or did I misunderstand that.
>
> Either spell all out (no default) OR add a default.
>
> I prefer to just ... use the default :)

I mean yup that's fine too I guess, all or nothing, something in between is
weird!

>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2026-03-19 15:59 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
2026-02-26  3:22 ` [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
2026-03-12 20:00   ` David Hildenbrand (Arm)
2026-02-26  3:23 ` [PATCH mm-unstable v15 02/13] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
2026-03-12 20:05   ` David Hildenbrand (Arm)
2026-02-26  3:23 ` [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-03-12 20:32   ` David Hildenbrand (Arm)
2026-03-12 20:36     ` David Hildenbrand (Arm)
2026-03-12 20:56       ` David Hildenbrand (Arm)
2026-02-26  3:24 ` [PATCH mm-unstable v15 04/13] mm/khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
2026-02-26  3:24 ` [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-03-17 16:51   ` Lorenzo Stoakes (Oracle)
2026-03-17 17:16     ` Randy Dunlap
2026-02-26  3:24 ` [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-03-12 21:00   ` David Hildenbrand (Arm)
2026-02-26  3:25 ` [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-03-12 21:03   ` David Hildenbrand (Arm)
2026-03-17 17:05   ` Lorenzo Stoakes (Oracle)
2026-02-26  3:25 ` [PATCH mm-unstable v15 08/13] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-03-12 21:05   ` David Hildenbrand (Arm)
2026-02-26  3:25 ` [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2026-03-12 21:09   ` David Hildenbrand (Arm)
2026-03-17 17:08   ` Lorenzo Stoakes (Oracle)
2026-02-26  3:26 ` [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support Nico Pache
2026-03-12 21:16   ` David Hildenbrand (Arm)
2026-03-17 21:36   ` Lorenzo Stoakes (Oracle)
2026-02-26  3:26 ` [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-02-26 16:26   ` Usama Arif
2026-02-26 20:47     ` Nico Pache
2026-03-12 21:19   ` David Hildenbrand (Arm)
2026-03-17 10:35   ` Lorenzo Stoakes (Oracle)
2026-03-18 18:59     ` Nico Pache
2026-03-18 19:48       ` David Hildenbrand (Arm)
2026-03-19 15:59         ` Lorenzo Stoakes (Oracle)
2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
2026-02-26 15:53   ` Usama Arif
2026-03-12 21:22   ` David Hildenbrand (Arm)
2026-03-17 10:58   ` Lorenzo Stoakes (Oracle)
2026-03-18 19:02     ` Nico Pache
2026-03-17 11:36   ` Lance Yang
2026-03-18 19:07     ` Nico Pache
2026-02-26  3:27 ` [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-03-17 11:02   ` Lorenzo Stoakes (Oracle)
2026-03-18 19:08     ` Nico Pache
2026-03-18 19:49       ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox