[PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
@ 2026-05-22 14:59 Nico Pache
  2026-05-22 14:59 ` [PATCH mm-unstable v18 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
                   ` (17 more replies)
  0 siblings, 18 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 14:59 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
pages that are occupied (!none/zero). After the PMD scan is done, we use
the bitmap to find the optimal mTHP sizes for the PMD range. The
restriction on max_ptes_none is removed during the scan, to make sure we
account for the whole PMD range in the bitmap. When no mTHP size is
enabled, the legacy behavior of khugepaged is maintained.

We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
(ie 511). If any other value is specified, the kernel will emit a warning
and mTHP collapse will default to max_ptes_none=0. If a mTHP collapse is
attempted, but contains swapped out, or shared pages, we don't perform
the collapse.
It is now also possible to collapse to mTHPs without requiring the PMD THP
size to be enabled. These limitations are to prevent collapse "creep"
behavior. This prevents constantly promoting mTHPs to the next available
size, which would occur because a collapse introduces more non-zero pages
that would satisfy the promotion condition on subsequent scans.

Patch 1-2:   Generalize hugepage_vma_revalidate and alloc_charge_folio
             for arbitrary orders.
Patch 3:     Rework max_ptes_* handling into helper functions
Patch 4:     Generalize __collapse_huge_page_* for mTHP support
Patch 5:     Require collapse_huge_page to enter/exit with the lock dropped
Patch 6:     Generalize collapse_huge_page for mTHP collapse
Patch 7:     Skip collapsing mTHP to smaller orders
Patch 8-9:   Add per-order mTHP statistics and tracepoints
Patch 10:    Introduce collapse_allowable_orders helper function
Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
Patch 14:    Documentation

Testing:
- Built for x86_64, aarch64, ppc64le, and s390x
- ran all arches on test suites provided by the kernel-tests project
- internal testing suites: functional testing and performance testing
- selftests mm
- I created a test script that I used to push khugepaged to its limits
   while monitoring a number of stats and tracepoints. The code is
   available here[1] (Run in legacy mode for these changes and set mthp
   sizes to inherit)
   The summary from my testings was that there was no significant
   regression noticed through this test. In some cases my changes had
   better collapse latencies, and was able to scan more pages in the same
   amount of time/work, but for the most part the results were consistent.
- redis testing. I did some testing with these changes along with my defer
  changes (see followup [2] post for more details). We've decided to get
  the mTHP changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

V18 Changes:
- Added RBs/Acks
- [patch 02] Guard count_memcg_folio_events with is_pmd_order() to keep
  THP_COLLAPSE_ALLOC PMD-only (Usama, Lance)
- [patch 03] Convert C++ comments to C-style; fix "none-page" terminology
  to "empty PTEs or PTEs mapping the shared zeropage"; drop unnecessary
  userfaultfd comment; add const to local max_ptes_* variables; fix
  "repect" typo (Lance, David)
- [patch 04] collapse_max_ptes_none() now returns 0 instead of -EINVAL for
  unsupported values; remove SCAN_INVALID_PTES_NONE; change return type
  from int to unsigned int and propagate to all callers; add comment above
  __collapse_huge_page_swapin explaining mTHP swap bail-out (David,
  Lorenzo, Lance, Wei Yang, Usama)
- [patch 05] Rewrite collapse_huge_page lock comment to David's suggested
  wording (David)
- [patch 11] Propagate unsigned int return type for max_ptes_none; remove
  the now-unnecessary negative return check (consequence of patch 04);
  Add optimization to the next_order goto that will prevent unnecessary
  iterations if there are no lower orders enabled (Vernon); update locking
  comment; pass VMA to mthp_collapse to improve uffd-armed detection, and
  prevent unnecessary work. (Wei)
- [patch 14] Update documentation to reflect fallback-to-0 behavior

V17: https://lore.kernel.org/all/20260511185817.686831-1-npache@redhat.com
V16: https://lore.kernel.org/all/20260419185750.260784-1-npache@redhat.com
V15: https://lore.kernel.org/all/20260226031741.230674-1-npache@redhat.com
V14: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com
V13: https://lore.kernel.org/all/20251201174627.23295-1-npache@redhat.com
V12: https://lore.kernel.org/all/20251022183717.70829-1-npache@redhat.com
V11: https://lore.kernel.org/all/20250912032810.197475-1-npache@redhat.com
V10: https://lore.kernel.org/all/20250819134205.622806-1-npache@redhat.com
V9 : https://lore.kernel.org/all/20250714003207.113275-1-npache@redhat.com
V8 : https://lore.kernel.org/all/20250702055742.102808-1-npache@redhat.com
V7 : https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com
V6 : https://lore.kernel.org/all/20250515030312.125567-1-npache@redhat.com
V5 : https://lore.kernel.org/all/20250428181218.85925-1-npache@redhat.com
V4 : https://lore.kernel.org/all/20250417000238.74567-1-npache@redhat.com
V3 : https://lore.kernel.org/all/20250414220557.35388-1-npache@redhat.com
V2 : https://lore.kernel.org/all/20250211003028.213461-1-npache@redhat.com
V1 : https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com

Baolin Wang (1):
  mm/khugepaged: run khugepaged for all orders

Dev Jain (1):
  mm/khugepaged: generalize alloc_charge_folio()

Nico Pache (12):
  mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support
  mm/khugepaged: rework max_ptes_* handling with helper functions
  mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  mm/khugepaged: require collapse_huge_page to enter/exit with the lock
    dropped
  mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  mm/khugepaged: skip collapsing mTHP to smaller orders
  mm/khugepaged: add per-order mTHP collapse failure statistics
  mm/khugepaged: improve tracepoints for mTHP orders
  mm/khugepaged: introduce collapse_allowable_orders helper function
  mm/khugepaged: Introduce mTHP collapse support
  mm/khugepaged: avoid unnecessary mTHP collapse attempts
  Documentation: mm: update the admin guide for mTHP collapse

 Documentation/admin-guide/mm/transhuge.rst |  72 ++-
 include/linux/huge_mm.h                    |   5 +
 include/trace/events/huge_memory.h         |  34 +-
 mm/huge_memory.c                           |  11 +
 mm/khugepaged.c                            | 634 ++++++++++++++++-----
 5 files changed, 584 insertions(+), 172 deletions(-)


base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346
-- 
2.54.0


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
@ 2026-05-22 14:59 ` Nico Pache
  2026-05-22 14:59 ` [PATCH mm-unstable v18 02/14] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 14:59 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif

For khugepaged to support different mTHP orders, we must generalize this
to check if the PMD is not shared by another VMA and that the order is
enabled.

No functional change in this patch. Also correct a comment about the
functionality of the revalidation and fix a double space issues.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b8452dbdb043..53e7e4be172d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -902,12 +902,13 @@ static int collapse_find_target_node(struct collapse_control *cc)
 
 /*
  * If mmap_lock temporarily dropped, revalidate vma
- * before taking mmap_lock.
+ * after taking the mmap_lock again.
  * Returns enum scan_result value.
  */
 
 static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
-		bool expect_anon, struct vm_area_struct **vmap, struct collapse_control *cc)
+		bool expect_anon, struct vm_area_struct **vmap,
+		struct collapse_control *cc, unsigned int order)
 {
 	struct vm_area_struct *vma;
 	enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
@@ -920,15 +921,16 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l
 	if (!vma)
 		return SCAN_VMA_NULL;
 
+	/* Always check the PMD order to ensure its not shared by another VMA */
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
+	if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order)))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
 	 * remapped to file after khugepaged reaquired the mmap_lock.
 	 *
-	 * thp_vma_allowable_order may return true for qualified file
+	 * thp_vma_allowable_orders may return true for qualified file
 	 * vmas.
 	 */
 	if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
@@ -1121,7 +1123,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1155,7 +1158,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2857,8 +2861,8 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_unlocked = false;
 			*lock_dropped = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
-			if (result  != SCAN_SUCCEED) {
+							 cc, HPAGE_PMD_ORDER);
+			if (result != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
 			}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 02/14] mm/khugepaged: generalize alloc_charge_folio()
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
  2026-05-22 14:59 ` [PATCH mm-unstable v18 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2026-05-22 14:59 ` Nico Pache
  2026-05-22 14:59 ` [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 14:59 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif

From: Dev Jain <dev.jain@arm.com>

Pass order to alloc_charge_folio() and update mTHP statistics.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
 include/linux/huge_mm.h                    |  2 ++
 mm/huge_memory.c                           |  4 ++++
 mm/khugepaged.c                            | 20 +++++++++++++-------
 4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5fbc3d89bb07..c51932e6275d 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -639,6 +639,14 @@ anon_fault_fallback_charge
 	instead falls back to using huge pages with lower orders or
 	small pages even though the allocation was successful.
 
+collapse_alloc
+	is incremented every time a huge page is successfully allocated for a
+	khugepaged collapse.
+
+collapse_alloc_failed
+	is incremented every time a huge page allocation fails during a
+	khugepaged collapse.
+
 zswpout
 	is incremented every time a huge page is swapped out to zswap in one
 	piece without splitting.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2949e5acff35..ba7ae6808544 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,6 +128,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_COLLAPSE_ALLOC,
+	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..345c54133c83 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -685,6 +685,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -750,6 +752,8 @@ static struct attribute *any_stats_attrs[] = {
 #endif
 	&split_attr.attr,
 	&split_failed_attr.attr,
+	&collapse_alloc_attr.attr,
+	&collapse_alloc_failed_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 53e7e4be172d..13d82993755f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1068,28 +1068,34 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-		struct collapse_control *cc)
+		struct collapse_control *cc, unsigned int order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = collapse_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (is_pmd_order(order))
+			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
-	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (is_pmd_order(order))
+		count_vm_event(THP_COLLAPSE_ALLOC);
+	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
 		return SCAN_CGROUP_CHARGE_FAIL;
 	}
 
-	count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
+	if (is_pmd_order(order))
+		count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
 
 	*foliop = folio;
 	return SCAN_SUCCEED;
@@ -1118,7 +1124,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1899,7 +1905,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
  2026-05-22 14:59 ` [PATCH mm-unstable v18 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
  2026-05-22 14:59 ` [PATCH mm-unstable v18 02/14] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2026-05-22 14:59 ` Nico Pache
  2026-05-22 21:16   ` David Hildenbrand (Arm)
                     ` (2 more replies)
  2026-05-22 14:59 ` [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
                   ` (14 subsequent siblings)
  17 siblings, 3 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 14:59 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif

The following cleanup reworks all the max_ptes_* handling into helper
functions. This increases the code readability and will later be used to
implement the mTHP handling of these variables.

With these changes we abstract all the madvise_collapse() special casing
(do not respect the sysctls) away from the functions that utilize them.
And will be used later in this series to cleanly restrict the mTHP
collapse behavior.

No functional change is intended; however, we are now only reading the
sysfs variables once per scan, whereas before these variables were being
read on each loop iteration.

Reviewed-by: Lance Yang <lance.yang@linux.dev>
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 120 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 84 insertions(+), 36 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 13d82993755f..116f39518948 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -348,6 +348,64 @@ static bool pte_none_or_zero(pte_t pte)
 	return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
 }
 
+/**
+ * collapse_max_ptes_none - Calculate maximum allowed empty PTEs or PTEs mapping
+ * the shared zeropage for the given collapse operation.
+ * @cc: The collapse control struct
+ * @vma: The vma to check for userfaultfd
+ *
+ * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
+ */
+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
+		struct vm_area_struct *vma)
+{
+	if (vma && userfaultfd_armed(vma))
+		return 0;
+	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
+	if (!cc->is_khugepaged)
+		return HPAGE_PMD_NR;
+	/* For all other cases respect the user defined maximum */
+	return khugepaged_max_ptes_none;
+}
+
+/**
+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
+ * anonymous pages for the given collapse operation.
+ * @cc: The collapse control struct
+ *
+ * Return: Maximum number of PTEs that map shared anonymous pages for the
+ * collapse operation
+ */
+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
+{
+	/*
+	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
+	 * anonymous pages.
+	 */
+	if (!cc->is_khugepaged)
+		return HPAGE_PMD_NR;
+	return khugepaged_max_ptes_shared;
+}
+
+/**
+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
+ * maximum allowed non-present pagecache entries for the given collapse operation.
+ * @cc: The collapse control struct
+ *
+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
+ * pagecache entries for the collapse operation.
+ */
+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
+{
+	/*
+	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
+	 * pagecache entries that are non-present.
+	 */
+	if (!cc->is_khugepaged)
+		return HPAGE_PMD_NR;
+	return khugepaged_max_ptes_swap;
+}
+
 int hugepage_madvise(struct vm_area_struct *vma,
 		     vm_flags_t *vm_flags, int advice)
 {
@@ -540,6 +598,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
 		struct list_head *compound_pagelist)
 {
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
+	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr = start_addr;
@@ -551,16 +611,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
-			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
+			if (++none_or_zero > max_ptes_none) {
 				result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out;
 			}
+			continue;
 		}
 		if (!pte_present(pteval)) {
 			result = SCAN_PTE_NON_PRESENT;
@@ -591,9 +647,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		/* See collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
-			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (++shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -1262,6 +1316,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		bool *lock_dropped, struct collapse_control *cc)
 {
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
+	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
+	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
 	int none_or_zero = 0, shared = 0, referenced = 0;
@@ -1295,36 +1352,29 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
-			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
+			if (++none_or_zero > max_ptes_none) {
 				result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out_unmap;
 			}
+			continue;
 		}
 		if (!pte_present(pteval)) {
-			++unmapped;
-			if (!cc->is_khugepaged ||
-			    unmapped <= khugepaged_max_ptes_swap) {
-				/*
-				 * Always be strict with uffd-wp
-				 * enabled swap entries.  Please see
-				 * comment below for pte_uffd_wp().
-				 */
-				if (pte_swp_uffd_wp_any(pteval)) {
-					result = SCAN_PTE_UFFD_WP;
-					goto out_unmap;
-				}
-				continue;
-			} else {
+			if (++unmapped > max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				goto out_unmap;
 			}
+			/*
+			 * Always be strict with uffd-wp
+			 * enabled swap entries.  Please see
+			 * comment below for pte_uffd_wp().
+			 */
+			if (pte_swp_uffd_wp_any(pteval)) {
+				result = SCAN_PTE_UFFD_WP;
+				goto out_unmap;
+			}
+			continue;
 		}
 		if (pte_uffd_wp(pteval)) {
 			/*
@@ -1367,9 +1417,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		 * is shared.
 		 */
 		if (folio_maybe_mapped_shared(folio)) {
-			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (++shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out_unmap;
@@ -2324,6 +2372,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
 		unsigned long addr, struct file *file, pgoff_t start,
 		struct collapse_control *cc)
 {
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
+	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
 	XA_STATE(xas, &mapping->i_pages, start);
@@ -2342,8 +2392,7 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
 
 		if (xa_is_value(folio)) {
 			swap += 1 << xas_get_order(&xas);
-			if (cc->is_khugepaged &&
-			    swap > khugepaged_max_ptes_swap) {
+			if (swap > max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
@@ -2414,8 +2463,7 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
 		cc->progress += HPAGE_PMD_NR;
 
 	if (result == SCAN_SUCCEED) {
-		if (cc->is_khugepaged &&
-		    present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (present < HPAGE_PMD_NR - max_ptes_none) {
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
  2026-05-22 14:59 ` [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
@ 2026-05-22 21:16   ` David Hildenbrand (Arm)
  2026-06-01 13:26   ` Lorenzo Stoakes
  2026-06-05 16:04   ` Zi Yan
  2 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-22 21:16 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Usama Arif


>  int hugepage_madvise(struct vm_area_struct *vma,
>  		     vm_flags_t *vm_flags, int advice)
>  {
> @@ -540,6 +598,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>  		struct list_head *compound_pagelist)
>  {
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);


Yeah, it's good that these are all const now.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
  2026-05-22 14:59 ` [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
  2026-05-22 21:16   ` David Hildenbrand (Arm)
@ 2026-06-01 13:26   ` Lorenzo Stoakes
  2026-06-05 16:04   ` Zi Yan
  2 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-01 13:26 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

On Fri, May 22, 2026 at 08:59:58AM -0600, Nico Pache wrote:
> The following cleanup reworks all the max_ptes_* handling into helper
> functions. This increases the code readability and will later be used to
> implement the mTHP handling of these variables.
>
> With these changes we abstract all the madvise_collapse() special casing
> (do not respect the sysctls) away from the functions that utilize them.
> And will be used later in this series to cleanly restrict the mTHP
> collapse behavior.
>
> No functional change is intended; however, we are now only reading the
> sysfs variables once per scan, whereas before these variables were being
> read on each loop iteration.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>

Had a read through, this is nice, all LGTM, so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 120 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 84 insertions(+), 36 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 13d82993755f..116f39518948 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -348,6 +348,64 @@ static bool pte_none_or_zero(pte_t pte)
>  	return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
>  }
>
> +/**
> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs or PTEs mapping
> + * the shared zeropage for the given collapse operation.
> + * @cc: The collapse control struct
> + * @vma: The vma to check for userfaultfd
> + *
> + * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
> + */
> +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> +		struct vm_area_struct *vma)
> +{
> +	if (vma && userfaultfd_armed(vma))
> +		return 0;
> +	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
> +	if (!cc->is_khugepaged)
> +		return HPAGE_PMD_NR;
> +	/* For all other cases respect the user defined maximum */
> +	return khugepaged_max_ptes_none;
> +}
> +
> +/**
> + * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> + * anonymous pages for the given collapse operation.
> + * @cc: The collapse control struct
> + *
> + * Return: Maximum number of PTEs that map shared anonymous pages for the
> + * collapse operation
> + */
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +{
> +	/*
> +	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> +	 * anonymous pages.
> +	 */
> +	if (!cc->is_khugepaged)
> +		return HPAGE_PMD_NR;
> +	return khugepaged_max_ptes_shared;
> +}
> +
> +/**
> + * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> + * maximum allowed non-present pagecache entries for the given collapse operation.
> + * @cc: The collapse control struct
> + *
> + * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> + * pagecache entries for the collapse operation.
> + */
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +{
> +	/*
> +	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
> +	 * pagecache entries that are non-present.
> +	 */
> +	if (!cc->is_khugepaged)
> +		return HPAGE_PMD_NR;
> +	return khugepaged_max_ptes_swap;
> +}
> +
>  int hugepage_madvise(struct vm_area_struct *vma,
>  		     vm_flags_t *vm_flags, int advice)
>  {
> @@ -540,6 +598,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>  		struct list_head *compound_pagelist)
>  {
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
> @@ -551,16 +611,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  				goto out;
>  			}
> +			continue;
>  		}
>  		if (!pte_present(pteval)) {
>  			result = SCAN_PTE_NON_PRESENT;
> @@ -591,9 +647,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out;
> @@ -1262,6 +1316,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -1295,36 +1352,29 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  				goto out_unmap;
>  			}
> +			continue;
>  		}
>  		if (!pte_present(pteval)) {
> -			++unmapped;
> -			if (!cc->is_khugepaged ||
> -			    unmapped <= khugepaged_max_ptes_swap) {
> -				/*
> -				 * Always be strict with uffd-wp
> -				 * enabled swap entries.  Please see
> -				 * comment below for pte_uffd_wp().
> -				 */
> -				if (pte_swp_uffd_wp_any(pteval)) {
> -					result = SCAN_PTE_UFFD_WP;
> -					goto out_unmap;
> -				}
> -				continue;
> -			} else {
> +			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>  				goto out_unmap;
>  			}
> +			/*
> +			 * Always be strict with uffd-wp
> +			 * enabled swap entries.  Please see
> +			 * comment below for pte_uffd_wp().
> +			 */
> +			if (pte_swp_uffd_wp_any(pteval)) {
> +				result = SCAN_PTE_UFFD_WP;
> +				goto out_unmap;
> +			}
> +			continue;
>  		}
>  		if (pte_uffd_wp(pteval)) {
>  			/*
> @@ -1367,9 +1417,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		 * is shared.
>  		 */
>  		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out_unmap;
> @@ -2324,6 +2372,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		unsigned long addr, struct file *file, pgoff_t start,
>  		struct collapse_control *cc)
>  {
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	XA_STATE(xas, &mapping->i_pages, start);
> @@ -2342,8 +2392,7 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>
>  		if (xa_is_value(folio)) {
>  			swap += 1 << xas_get_order(&xas);
> -			if (cc->is_khugepaged &&
> -			    swap > khugepaged_max_ptes_swap) {
> +			if (swap > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>  				break;
> @@ -2414,8 +2463,7 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		cc->progress += HPAGE_PMD_NR;
>
>  	if (result == SCAN_SUCCEED) {
> -		if (cc->is_khugepaged &&
> -		    present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> +		if (present < HPAGE_PMD_NR - max_ptes_none) {
>  			result = SCAN_EXCEED_NONE_PTE;
>  			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  		} else {
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
  2026-05-22 14:59 ` [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
  2026-05-22 21:16   ` David Hildenbrand (Arm)
  2026-06-01 13:26   ` Lorenzo Stoakes
@ 2026-06-05 16:04   ` Zi Yan
  2 siblings, 0 replies; 114+ messages in thread
From: Zi Yan @ 2026-06-05 16:04 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang,
	zokeefe, Usama Arif

On 22 May 2026, at 10:59, Nico Pache wrote:

> The following cleanup reworks all the max_ptes_* handling into helper
> functions. This increases the code readability and will later be used to
> implement the mTHP handling of these variables.
>
> With these changes we abstract all the madvise_collapse() special casing
> (do not respect the sysctls) away from the functions that utilize them.
> And will be used later in this series to cleanly restrict the mTHP
> collapse behavior.
>
> No functional change is intended; however, we are now only reading the
> sysfs variables once per scan, whereas before these variables were being
> read on each loop iteration.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 120 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 84 insertions(+), 36 deletions(-)
>

userfaultfd_armed() and cc->is_khugepaged check results are now folded
into collapse_max_ptes_*() return values, using 0 and HPAGE_PMD_NR.
It simplifies the caller code. LGTM.

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (2 preceding siblings ...)
  2026-05-22 14:59 ` [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
@ 2026-05-22 14:59 ` Nico Pache
  2026-05-22 21:24   ` David Hildenbrand (Arm)
                     ` (2 more replies)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
                   ` (13 subsequent siblings)
  17 siblings, 3 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 14:59 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

generalize the order of the __collapse_huge_page_* and collapse_max_*
functions to support future mTHP collapse.

The current mechanism for determining collapse with the
khugepaged_max_ptes_none value is not designed with mTHP in mind. This
raises a key design issue: if we support user defined max_pte_none values
(even those scaled by order), a collapse of a lower order can introduces
an feedback loop, or "creep", when max_ptes_none is set to a value greater
than HPAGE_PMD_NR / 2. [1]

With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.

To fix this issue introduce a helper function that will limit mTHP
collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
This effectively supports two modes: [2]

- max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
  that maps the shared zeropage. Consequently, no memory bloat.
- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
  available mTHP order.

This removes the possibility of "creep", and a warning will be emitted if
any non-supported max_ptes_none value is configured with mTHP enabled.
Any intermediate value will default mTHP collapse to max_ptes_none=0.

mTHP collapse will not honor the khugepaged_max_ptes_shared or
khugepaged_max_ptes_swap parameters, and will fail if it encounters a
shared or swapped entry.

No functional changes in this patch; however it defines future behavior
for mTHP collapse.

[1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
[2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com

Reviewed-by: Lance Yang <lance.yang@linux.dev>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 88 insertions(+), 33 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 116f39518948..e98ba5b15163 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -353,30 +353,52 @@ static bool pte_none_or_zero(pte_t pte)
  * the shared zeropage for the given collapse operation.
  * @cc: The collapse control struct
  * @vma: The vma to check for userfaultfd
+ * @order: The folio order being collapsed to
  *
  * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
  */
 static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma, unsigned int order)
 {
+	unsigned int max_ptes_none = khugepaged_max_ptes_none;
+
 	if (vma && userfaultfd_armed(vma))
 		return 0;
 	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
 	if (!cc->is_khugepaged)
 		return HPAGE_PMD_NR;
-	/* For all other cases respect the user defined maximum */
-	return khugepaged_max_ptes_none;
+	/* for PMD collapse, respect the user defined maximum */
+	if (is_pmd_order(order))
+		return max_ptes_none;
+	/*
+	 * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
+	 * scale the maximum number of PTEs to the order of the collapse.
+	 */
+	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
+		return (1 << order) - 1;
+	if (!max_ptes_none)
+		return 0;
+	/*
+	 * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
+	 * emit a warning and return 0.
+	 */
+	pr_warn_once("mTHP collapse does not support max_ptes_none values"
+		     " other than 0 or %u, defaulting to 0.\n",
+		     KHUGEPAGED_MAX_PTES_LIMIT);
+	return 0;
 }
 
 /**
  * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
  * anonymous pages for the given collapse operation.
  * @cc: The collapse control struct
+ * @order: The folio order being collapsed to
  *
  * Return: Maximum number of PTEs that map shared anonymous pages for the
  * collapse operation
  */
-static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
+		unsigned int order)
 {
 	/*
 	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
@@ -384,6 +406,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
 	 */
 	if (!cc->is_khugepaged)
 		return HPAGE_PMD_NR;
+	/*
+	 * for mTHP collapse do not allow collapsing anonymous memory pages that
+	 * are shared between processes.
+	 */
+	if (!is_pmd_order(order))
+		return 0;
+	/* for PMD collapse, respect the user defined maximum */
 	return khugepaged_max_ptes_shared;
 }
 
@@ -391,11 +420,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
  * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
  * maximum allowed non-present pagecache entries for the given collapse operation.
  * @cc: The collapse control struct
+ * @order: The folio order being collapsed to
  *
  * Return: Maximum number of non-present PTEs or the maximum allowed non-present
  * pagecache entries for the collapse operation.
  */
-static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
+		unsigned int order)
 {
 	/*
 	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
@@ -403,6 +434,10 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
 	 */
 	if (!cc->is_khugepaged)
 		return HPAGE_PMD_NR;
+	/* for mTHP collapse do not allow any non-present PTEs or pagecache entries */
+	if (!is_pmd_order(order))
+		return 0;
+	/* for PMD collapse, respect the user defined maximum */
 	return khugepaged_max_ptes_swap;
 }
 
@@ -596,10 +631,11 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
 
 static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
-		struct list_head *compound_pagelist)
+		unsigned int order, struct list_head *compound_pagelist)
 {
-	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
-	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
+	const unsigned long nr_pages = 1UL << order;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr = start_addr;
@@ -607,7 +643,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	int none_or_zero = 0, shared = 0, referenced = 0;
 	enum scan_result result = SCAN_FAIL;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + nr_pages;
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
@@ -740,18 +776,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 }
 
 static void __collapse_huge_page_copy_succeeded(pte_t *pte,
-						struct vm_area_struct *vma,
-						unsigned long address,
-						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+		struct vm_area_struct *vma, unsigned long address,
+		spinlock_t *ptl, unsigned int order,
+		struct list_head *compound_pagelist)
 {
-	unsigned long end = address + HPAGE_PMD_SIZE;
+	const unsigned long nr_pages = 1UL << order;
+	unsigned long end = address + (PAGE_SIZE << order);
 	struct folio *src, *tmp;
 	pte_t pteval;
 	pte_t *_pte;
 	unsigned int nr_ptes;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
 	     address += nr_ptes * PAGE_SIZE) {
 		nr_ptes = 1;
 		pteval = ptep_get(_pte);
@@ -804,11 +840,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 }
 
 static void __collapse_huge_page_copy_failed(pte_t *pte,
-					     pmd_t *pmd,
-					     pmd_t orig_pmd,
-					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
+		unsigned int order, struct list_head *compound_pagelist)
 {
+	const unsigned long nr_pages = 1UL << order;
 	spinlock_t *pmd_ptl;
 
 	/*
@@ -824,7 +859,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
 }
 
 /*
@@ -844,16 +879,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
  */
 static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
-		unsigned long address, spinlock_t *ptl,
+		unsigned long address, spinlock_t *ptl, unsigned int order,
 		struct list_head *compound_pagelist)
 {
+	const unsigned long nr_pages = 1UL << order;
 	unsigned int i;
 	enum scan_result result = SCAN_SUCCEED;
 
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -872,10 +908,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    order, compound_pagelist);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 order, compound_pagelist);
 
 	return result;
 }
@@ -1042,16 +1078,20 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
+ * For mTHP orders the function bails on the first swap entry, because
+ * faulting pages back in during collapse could re-populate PTEs that
+ * push a later scan over the threshold for a higher-order collapse.
+ *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
  */
 static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
-		int referenced)
+		struct vm_area_struct *vma, unsigned long start_addr,
+		pmd_t *pmd, int referenced, unsigned int order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
 	enum scan_result result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1083,6 +1123,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 		    pte_present(vmf.orig_pte))
 			continue;
 
+		/*
+		 * TODO: Support swapin without leading to further mTHP
+		 * collapses. Currently bringing in new pages via swapin may
+		 * cause a future higher order collapse on a rescan of the same
+		 * range.
+		 */
+		if (!is_pmd_order(order)) {
+			pte_unmap(pte);
+			mmap_read_unlock(mm);
+			result = SCAN_EXCEED_SWAP_PTE;
+			goto out;
+		}
+
 		vmf.pte = pte;
 		vmf.ptl = ptl;
 		ret = do_swap_page(&vmf);
@@ -1203,7 +1256,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+						     referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1251,6 +1304,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
+						      HPAGE_PMD_ORDER,
 						      &compound_pagelist);
 		spin_unlock(pte_ptl);
 	} else {
@@ -1281,6 +1335,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
+					   HPAGE_PMD_ORDER,
 					   &compound_pagelist);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
@@ -1316,9 +1371,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		bool *lock_dropped, struct collapse_control *cc)
 {
-	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
-	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
-	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
+	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
+	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
 	int none_or_zero = 0, shared = 0, referenced = 0;
@@ -2372,8 +2427,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
 		unsigned long addr, struct file *file, pgoff_t start,
 		struct collapse_control *cc)
 {
-	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
-	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
+	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
 	XA_STATE(xas, &mapping->i_pages, start);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-05-22 14:59 ` [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2026-05-22 21:24   ` David Hildenbrand (Arm)
  2026-05-26 14:39   ` Nico Pache
  2026-06-01 14:04   ` Lorenzo Stoakes
  2 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-22 21:24 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 5/22/26 16:59, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
> 
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
> 
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
> 
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
> 
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>   that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>   available mTHP order.
> 
> This removes the possibility of "creep", and a warning will be emitted if
> any non-supported max_ptes_none value is configured with mTHP enabled.
> Any intermediate value will default mTHP collapse to max_ptes_none=0.
> 
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
> 
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
> 
> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> 
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 88 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 116f39518948..e98ba5b15163 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -353,30 +353,52 @@ static bool pte_none_or_zero(pte_t pte)
>   * the shared zeropage for the given collapse operation.
>   * @cc: The collapse control struct
>   * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
>   */
>  static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> -		struct vm_area_struct *vma)
> +		struct vm_area_struct *vma, unsigned int order)
>  {
> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;

Can be const, right?

> +
>  	if (vma && userfaultfd_armed(vma))
>  		return 0;
>  	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> -	/* For all other cases respect the user defined maximum */
> -	return khugepaged_max_ptes_none;
> +	/* for PMD collapse, respect the user defined maximum */
> +	if (is_pmd_order(order))
> +		return max_ptes_none;
> +	/*
> +	 * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * scale the maximum number of PTEs to the order of the collapse.
> +	 */
> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> +		return (1 << order) - 1;
> +	if (!max_ptes_none)
> +		return 0;
> +	/*
> +	 * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * emit a warning and return 0.
> +	 */
> +	pr_warn_once("mTHP collapse does not support max_ptes_none values"
> +		     " other than 0 or %u, defaulting to 0.\n",
> +		     KHUGEPAGED_MAX_PTES_LIMIT);
> +	return 0;

This might read slightly clearer as

/*
 * For mTHP ...
 */
if (max_ptes_none)
		pr_warn_once(...)
return 0;

IOW, have a single "return 0" label here and only special-case when to warn.

Acked-by: David Hildenbrand (arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-05-22 14:59 ` [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
  2026-05-22 21:24   ` David Hildenbrand (Arm)
@ 2026-05-26 14:39   ` Nico Pache
  2026-06-01 14:04   ` Lorenzo Stoakes
  2 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-26 14:39 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
  Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe



On 5/22/26 8:59 AM, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
> 
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
> 
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
> 
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
> 
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>   that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>   available mTHP order.
> 
> This removes the possibility of "creep", and a warning will be emitted if
> any non-supported max_ptes_none value is configured with mTHP enabled.
> Any intermediate value will default mTHP collapse to max_ptes_none=0.
> 
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
> 
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.

Hi Andrew,

Can you please append the following fixup to this commit.

Changes are described below but are very minor nits!

Thank you!

commit e3985903daa4fa77a27a632c1c2fa4c23aac9ad5
Author: Nico Pache <npache@redhat.com>
Date:   Tue May 26 05:40:03 2026 -0600

    fixup: cleanup collapse_max_ptes_none
    
    make max_ptes_none a const and cleanup the pr_warn_once
    
    Acked-by: David Hildenbrand (arm) <david@kernel.org>
    Signed-off-by: Nico Pache <npache@redhat.com>

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e98ba5b15163..4c7e404b0f8d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -360,7 +360,7 @@ static bool pte_none_or_zero(pte_t pte)
 static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
 		struct vm_area_struct *vma, unsigned int order)
 {
-	unsigned int max_ptes_none = khugepaged_max_ptes_none;
+	const unsigned int max_ptes_none = khugepaged_max_ptes_none;
 
 	if (vma && userfaultfd_armed(vma))
 		return 0;
@@ -376,14 +376,13 @@ static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
 	 */
 	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
 		return (1 << order) - 1;
-	if (!max_ptes_none)
-		return 0;
 	/*
 	 * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
 	 * emit a warning and return 0.
 	 */
-	pr_warn_once("mTHP collapse does not support max_ptes_none values"
-		     " other than 0 or %u, defaulting to 0.\n",
+	if (max_ptes_none)
+		pr_warn_once("mTHP collapse does not support max_ptes_none"
+		     " values other than 0 or %u, defaulting to 0.\n",
 		     KHUGEPAGED_MAX_PTES_LIMIT);
 	return 0;
 }


> 
> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> 
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 88 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 116f39518948..e98ba5b15163 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -353,30 +353,52 @@ static bool pte_none_or_zero(pte_t pte)
>   * the shared zeropage for the given collapse operation.
>   * @cc: The collapse control struct
>   * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
>   */
>  static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> -		struct vm_area_struct *vma)
> +		struct vm_area_struct *vma, unsigned int order)
>  {
> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;
> +
>  	if (vma && userfaultfd_armed(vma))
>  		return 0;
>  	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> -	/* For all other cases respect the user defined maximum */
> -	return khugepaged_max_ptes_none;
> +	/* for PMD collapse, respect the user defined maximum */
> +	if (is_pmd_order(order))
> +		return max_ptes_none;
> +	/*
> +	 * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * scale the maximum number of PTEs to the order of the collapse.
> +	 */
> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> +		return (1 << order) - 1;
> +	if (!max_ptes_none)
> +		return 0;
> +	/*
> +	 * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * emit a warning and return 0.
> +	 */
> +	pr_warn_once("mTHP collapse does not support max_ptes_none values"
> +		     " other than 0 or %u, defaulting to 0.\n",
> +		     KHUGEPAGED_MAX_PTES_LIMIT);
> +	return 0;
>  }
>  
>  /**
>   * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
>   * anonymous pages for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of PTEs that map shared anonymous pages for the
>   * collapse operation
>   */
> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	/*
>  	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> @@ -384,6 +406,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>  	 */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	/*
> +	 * for mTHP collapse do not allow collapsing anonymous memory pages that
> +	 * are shared between processes.
> +	 */
> +	if (!is_pmd_order(order))
> +		return 0;
> +	/* for PMD collapse, respect the user defined maximum */
>  	return khugepaged_max_ptes_shared;
>  }
>  
> @@ -391,11 +420,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>   * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
>   * maximum allowed non-present pagecache entries for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of non-present PTEs or the maximum allowed non-present
>   * pagecache entries for the collapse operation.
>   */
> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	/*
>  	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
> @@ -403,6 +434,10 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>  	 */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	/* for mTHP collapse do not allow any non-present PTEs or pagecache entries */
> +	if (!is_pmd_order(order))
> +		return 0;
> +	/* for PMD collapse, respect the user defined maximum */
>  	return khugepaged_max_ptes_swap;
>  }
>  
> @@ -596,10 +631,11 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>  
>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> -		struct list_head *compound_pagelist)
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
> +	const unsigned long nr_pages = 1UL << order;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
> @@ -607,7 +643,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	int none_or_zero = 0, shared = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + nr_pages;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> @@ -740,18 +776,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  }
>  
>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> -						struct vm_area_struct *vma,
> -						unsigned long address,
> -						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +		struct vm_area_struct *vma, unsigned long address,
> +		spinlock_t *ptl, unsigned int order,
> +		struct list_head *compound_pagelist)
>  {
> -	unsigned long end = address + HPAGE_PMD_SIZE;
> +	const unsigned long nr_pages = 1UL << order;
> +	unsigned long end = address + (PAGE_SIZE << order);
>  	struct folio *src, *tmp;
>  	pte_t pteval;
>  	pte_t *_pte;
>  	unsigned int nr_ptes;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>  	     address += nr_ptes * PAGE_SIZE) {
>  		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
> @@ -804,11 +840,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  }
>  
>  static void __collapse_huge_page_copy_failed(pte_t *pte,
> -					     pmd_t *pmd,
> -					     pmd_t orig_pmd,
> -					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	spinlock_t *pmd_ptl;
>  
>  	/*
> @@ -824,7 +859,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  	 * Release both raw and compound pages isolated
>  	 * in __collapse_huge_page_isolate.
>  	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>  }
>  
>  /*
> @@ -844,16 +879,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   */
>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> -		unsigned long address, spinlock_t *ptl,
> +		unsigned long address, spinlock_t *ptl, unsigned int order,
>  		struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	unsigned int i;
>  	enum scan_result result = SCAN_SUCCEED;
>  
>  	/*
>  	 * Copying pages' contents is subject to memory poison at any iteration.
>  	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < nr_pages; i++) {
>  		pte_t pteval = ptep_get(pte + i);
>  		struct page *page = folio_page(folio, i);
>  		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -872,10 +908,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>  
>  	if (likely(result == SCAN_SUCCEED))
>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    order, compound_pagelist);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 order, compound_pagelist);
>  
>  	return result;
>  }
> @@ -1042,16 +1078,20 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
>   *
> + * For mTHP orders the function bails on the first swap entry, because
> + * faulting pages back in during collapse could re-populate PTEs that
> + * push a later scan over the threshold for a higher-order collapse.
> + *
>   * Called and returns without pte mapped or spinlocks held.
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>   */
>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> -		int referenced)
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		pmd_t *pmd, int referenced, unsigned int order)
>  {
>  	int swapped_in = 0;
>  	vm_fault_t ret = 0;
> -	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>  	enum scan_result result;
>  	pte_t *pte = NULL;
>  	spinlock_t *ptl;
> @@ -1083,6 +1123,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		    pte_present(vmf.orig_pte))
>  			continue;
>  
> +		/*
> +		 * TODO: Support swapin without leading to further mTHP
> +		 * collapses. Currently bringing in new pages via swapin may
> +		 * cause a future higher order collapse on a rescan of the same
> +		 * range.
> +		 */
> +		if (!is_pmd_order(order)) {
> +			pte_unmap(pte);
> +			mmap_read_unlock(mm);
> +			result = SCAN_EXCEED_SWAP_PTE;
> +			goto out;
> +		}
> +
>  		vmf.pte = pte;
>  		vmf.ptl = ptl;
>  		ret = do_swap_page(&vmf);
> @@ -1203,7 +1256,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
>  		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced);
> +						     referenced, HPAGE_PMD_ORDER);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1251,6 +1304,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>  	if (pte) {
>  		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> +						      HPAGE_PMD_ORDER,
>  						      &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
> @@ -1281,6 +1335,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>  					   vma, address, pte_ptl,
> +					   HPAGE_PMD_ORDER,
>  					   &compound_pagelist);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
> @@ -1316,9 +1371,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -2372,8 +2427,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		unsigned long addr, struct file *file, pgoff_t start,
>  		struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	XA_STATE(xas, &mapping->i_pages, start);


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
  2026-05-22 14:59 ` [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
  2026-05-22 21:24   ` David Hildenbrand (Arm)
  2026-05-26 14:39   ` Nico Pache
@ 2026-06-01 14:04   ` Lorenzo Stoakes
  2 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-01 14:04 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Fri, May 22, 2026 at 08:59:59AM -0600, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
>
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
>
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>   that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>   available mTHP order.
>
> This removes the possibility of "creep", and a warning will be emitted if
> any non-supported max_ptes_none value is configured with mTHP enabled.
> Any intermediate value will default mTHP collapse to max_ptes_none=0.
>
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
>
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
>
> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

One small nit below, and with the fixpatch assumed applied, LGTM so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 88 insertions(+), 33 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 116f39518948..e98ba5b15163 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -353,30 +353,52 @@ static bool pte_none_or_zero(pte_t pte)
>   * the shared zeropage for the given collapse operation.
>   * @cc: The collapse control struct
>   * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
>   */
>  static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> -		struct vm_area_struct *vma)
> +		struct vm_area_struct *vma, unsigned int order)
>  {
> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;
> +
>  	if (vma && userfaultfd_armed(vma))
>  		return 0;
>  	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> -	/* For all other cases respect the user defined maximum */
> -	return khugepaged_max_ptes_none;
> +	/* for PMD collapse, respect the user defined maximum */
> +	if (is_pmd_order(order))
> +		return max_ptes_none;
> +	/*
> +	 * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * scale the maximum number of PTEs to the order of the collapse.
> +	 */
> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> +		return (1 << order) - 1;
> +	if (!max_ptes_none)
> +		return 0;
> +	/*
> +	 * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * emit a warning and return 0.
> +	 */
> +	pr_warn_once("mTHP collapse does not support max_ptes_none values"
> +		     " other than 0 or %u, defaulting to 0.\n",
> +		     KHUGEPAGED_MAX_PTES_LIMIT);
> +	return 0;
>  }
>
>  /**
>   * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
>   * anonymous pages for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of PTEs that map shared anonymous pages for the
>   * collapse operation
>   */
> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	/*
>  	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> @@ -384,6 +406,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>  	 */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	/*
> +	 * for mTHP collapse do not allow collapsing anonymous memory pages that
> +	 * are shared between processes.
> +	 */
> +	if (!is_pmd_order(order))
> +		return 0;
> +	/* for PMD collapse, respect the user defined maximum */
>  	return khugepaged_max_ptes_shared;
>  }
>
> @@ -391,11 +420,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>   * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
>   * maximum allowed non-present pagecache entries for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of non-present PTEs or the maximum allowed non-present
>   * pagecache entries for the collapse operation.
>   */
> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	/*
>  	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
> @@ -403,6 +434,10 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>  	 */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	/* for mTHP collapse do not allow any non-present PTEs or pagecache entries */
> +	if (!is_pmd_order(order))
> +		return 0;
> +	/* for PMD collapse, respect the user defined maximum */
>  	return khugepaged_max_ptes_swap;
>  }
>
> @@ -596,10 +631,11 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>
>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> -		struct list_head *compound_pagelist)
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
> +	const unsigned long nr_pages = 1UL << order;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
> @@ -607,7 +643,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	int none_or_zero = 0, shared = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + nr_pages;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> @@ -740,18 +776,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  }
>
>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> -						struct vm_area_struct *vma,
> -						unsigned long address,
> -						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +		struct vm_area_struct *vma, unsigned long address,
> +		spinlock_t *ptl, unsigned int order,
> +		struct list_head *compound_pagelist)
>  {
> -	unsigned long end = address + HPAGE_PMD_SIZE;
> +	const unsigned long nr_pages = 1UL << order;
> +	unsigned long end = address + (PAGE_SIZE << order);

Nit: could be address + PAGE_SIZE * nr_pages, be a little nicer as you already
did the << order operation above.

>  	struct folio *src, *tmp;
>  	pte_t pteval;
>  	pte_t *_pte;
>  	unsigned int nr_ptes;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>  	     address += nr_ptes * PAGE_SIZE) {
>  		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
> @@ -804,11 +840,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  }
>
>  static void __collapse_huge_page_copy_failed(pte_t *pte,
> -					     pmd_t *pmd,
> -					     pmd_t orig_pmd,
> -					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	spinlock_t *pmd_ptl;
>
>  	/*
> @@ -824,7 +859,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  	 * Release both raw and compound pages isolated
>  	 * in __collapse_huge_page_isolate.
>  	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>  }
>
>  /*
> @@ -844,16 +879,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   */
>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> -		unsigned long address, spinlock_t *ptl,
> +		unsigned long address, spinlock_t *ptl, unsigned int order,
>  		struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	unsigned int i;
>  	enum scan_result result = SCAN_SUCCEED;
>
>  	/*
>  	 * Copying pages' contents is subject to memory poison at any iteration.
>  	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < nr_pages; i++) {
>  		pte_t pteval = ptep_get(pte + i);
>  		struct page *page = folio_page(folio, i);
>  		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -872,10 +908,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>
>  	if (likely(result == SCAN_SUCCEED))
>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    order, compound_pagelist);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 order, compound_pagelist);
>
>  	return result;
>  }
> @@ -1042,16 +1078,20 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
>   *
> + * For mTHP orders the function bails on the first swap entry, because
> + * faulting pages back in during collapse could re-populate PTEs that
> + * push a later scan over the threshold for a higher-order collapse.
> + *
>   * Called and returns without pte mapped or spinlocks held.
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>   */
>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> -		int referenced)
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		pmd_t *pmd, int referenced, unsigned int order)
>  {
>  	int swapped_in = 0;
>  	vm_fault_t ret = 0;
> -	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>  	enum scan_result result;
>  	pte_t *pte = NULL;
>  	spinlock_t *ptl;
> @@ -1083,6 +1123,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		    pte_present(vmf.orig_pte))
>  			continue;
>
> +		/*
> +		 * TODO: Support swapin without leading to further mTHP
> +		 * collapses. Currently bringing in new pages via swapin may
> +		 * cause a future higher order collapse on a rescan of the same
> +		 * range.
> +		 */
> +		if (!is_pmd_order(order)) {
> +			pte_unmap(pte);
> +			mmap_read_unlock(mm);
> +			result = SCAN_EXCEED_SWAP_PTE;
> +			goto out;
> +		}
> +
>  		vmf.pte = pte;
>  		vmf.ptl = ptl;
>  		ret = do_swap_page(&vmf);
> @@ -1203,7 +1256,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
>  		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced);
> +						     referenced, HPAGE_PMD_ORDER);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1251,6 +1304,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>  	if (pte) {
>  		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> +						      HPAGE_PMD_ORDER,
>  						      &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
> @@ -1281,6 +1335,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>  					   vma, address, pte_ptl,
> +					   HPAGE_PMD_ORDER,
>  					   &compound_pagelist);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
> @@ -1316,9 +1371,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -2372,8 +2427,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		unsigned long addr, struct file *file, pgoff_t start,
>  		struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	XA_STATE(xas, &mapping->i_pages, start);
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (3 preceding siblings ...)
  2026-05-22 14:59 ` [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-06-01 14:07   ` Lorenzo Stoakes
  2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

Currently the collapse_huge_page function requires the mmap_read_lock to
enter with it held, and exit with it dropped. This function moves the
unlock into its parent caller, and changes this semantic to requiring it
to enter/exit with it always unlocked.

In future patches, we need this expectation, as for in mTHP collapse, we
may have already have dropped the lock, and do not want to conditionally
check for this by passing through the lock_dropped variable.

No functional change is expected as one of the first things the
collapse_huge_page function does is drop this lock before allocating the
hugepage.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e98ba5b15163..fab35d318641 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1208,6 +1208,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
 	return SCAN_SUCCEED;
 }
 
+/*
+ * collapse_huge_page expects the mmap_lock to be unlocked before entering and
+ * will always return with the lock unlocked, to avoid holding the mmap_lock
+ * while allocating a THP, as that could trigger direct reclaim/compaction.
+ * Note that the VMA must be rechecked after grabbing the mmap_lock again.
+ */
 static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		int referenced, int unmapped, struct collapse_control *cc)
 {
@@ -1223,14 +1229,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	/*
-	 * Before allocating the hugepage, release the mmap_lock read lock.
-	 * The allocation can take potentially a long time if it involves
-	 * sync compaction, and we do not need to hold the mmap_lock during
-	 * that. We will recheck the vma after taking it again in write mode.
-	 */
-	mmap_read_unlock(mm);
-
 	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
@@ -1535,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
+		/* collapse_huge_page expects the lock to be dropped before calling */
+		mmap_read_unlock(mm);
 		result = collapse_huge_page(mm, start_addr, referenced,
 					    unmapped, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
  2026-05-22 15:00 ` [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
@ 2026-06-01 14:07   ` Lorenzo Stoakes
  2026-06-02 10:26     ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-01 14:07 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Fri, May 22, 2026 at 09:00:00AM -0600, Nico Pache wrote:
> Currently the collapse_huge_page function requires the mmap_read_lock to
> enter with it held, and exit with it dropped. This function moves the
> unlock into its parent caller, and changes this semantic to requiring it
> to enter/exit with it always unlocked.
>
> In future patches, we need this expectation, as for in mTHP collapse, we
> may have already have dropped the lock, and do not want to conditionally
> check for this by passing through the lock_dropped variable.
>
> No functional change is expected as one of the first things the
> collapse_huge_page function does is drop this lock before allocating the
> hugepage.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Nico Pache <npache@redhat.com>

One small nit below, otherwise LGTM, so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e98ba5b15163..fab35d318641 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1208,6 +1208,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>  	return SCAN_SUCCEED;
>  }
>
> +/*
> + * collapse_huge_page expects the mmap_lock to be unlocked before entering and
> + * will always return with the lock unlocked, to avoid holding the mmap_lock
> + * while allocating a THP, as that could trigger direct reclaim/compaction.
> + * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> + */
>  static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		int referenced, int unmapped, struct collapse_control *cc)
>  {
> @@ -1223,14 +1229,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> -	/*
> -	 * Before allocating the hugepage, release the mmap_lock read lock.
> -	 * The allocation can take potentially a long time if it involves
> -	 * sync compaction, and we do not need to hold the mmap_lock during
> -	 * that. We will recheck the vma after taking it again in write mode.
> -	 */
> -	mmap_read_unlock(mm);
> -

NIT: Maybe worth an mmap_assert_locked()?

>  	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
> @@ -1535,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> +		/* collapse_huge_page expects the lock to be dropped before calling */
> +		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
>  					    unmapped, cc);
>  		/* collapse_huge_page will return with the mmap_lock released */
> --
> 2.54.0
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
  2026-06-01 14:07   ` Lorenzo Stoakes
@ 2026-06-02 10:26     ` Nico Pache
  0 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-02 10:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Mon, Jun 1, 2026 at 8:13 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:00AM -0600, Nico Pache wrote:
> > Currently the collapse_huge_page function requires the mmap_read_lock to
> > enter with it held, and exit with it dropped. This function moves the
> > unlock into its parent caller, and changes this semantic to requiring it
> > to enter/exit with it always unlocked.
> >
> > In future patches, we need this expectation, as for in mTHP collapse, we
> > may have already have dropped the lock, and do not want to conditionally
> > check for this by passing through the lock_dropped variable.
> >
> > No functional change is expected as one of the first things the
> > collapse_huge_page function does is drop this lock before allocating the
> > hugepage.
> >
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> One small nit below, otherwise LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

Thank you for reviewing!

>
> > ---
> >  mm/khugepaged.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index e98ba5b15163..fab35d318641 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1208,6 +1208,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >       return SCAN_SUCCEED;
> >  }
> >
> > +/*
> > + * collapse_huge_page expects the mmap_lock to be unlocked before entering and
> > + * will always return with the lock unlocked, to avoid holding the mmap_lock
> > + * while allocating a THP, as that could trigger direct reclaim/compaction.
> > + * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > + */
> >  static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               int referenced, int unmapped, struct collapse_control *cc)
> >  {
> > @@ -1223,14 +1229,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > -     /*
> > -      * Before allocating the hugepage, release the mmap_lock read lock.
> > -      * The allocation can take potentially a long time if it involves
> > -      * sync compaction, and we do not need to hold the mmap_lock during
> > -      * that. We will recheck the vma after taking it again in write mode.
> > -      */
> > -     mmap_read_unlock(mm);
> > -
>
> NIT: Maybe worth an mmap_assert_locked()?

But it will already be unlocked here. The contract is that we enter
unlocked and exit unlocked.

Cheers,
-- Nico

>
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> > @@ -1535,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > +             /* collapse_huge_page expects the lock to be dropped before calling */
> > +             mmap_read_unlock(mm);
> >               result = collapse_huge_page(mm, start_addr, referenced,
> >                                           unmapped, cc);
> >               /* collapse_huge_page will return with the mmap_lock released */
> > --
> > 2.54.0
> >
>
> Cheers, Lorenzo
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (4 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-22 21:47   ` David Hildenbrand (Arm)
                     ` (4 more replies)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (11 subsequent siblings)
  17 siblings, 5 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif

Pass an order and offset to collapse_huge_page to support collapsing anon
memory to arbitrary orders within a PMD. order indicates what mTHP size we
are attempting to collapse to, and offset indicates were in the PMD to
start the collapse attempt.

For non-PMD collapse we must leave the anon VMA write locked until after
we collapse the mTHP-- in the PMD case all the pages are isolated, but in
the mTHP case this is not true, and we must keep the lock to prevent
access/changes to the page tables. This can happen if the rmap walkers hit
a pmd_none while the PMD entry is currently unavailable due to being
temporarily removed during the collapse phase.

Acked-by: Usama Arif <usama.arif@linux.dev>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
 1 file changed, 55 insertions(+), 38 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fab35d318641..d64f42f66236 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
  * while allocating a THP, as that could trigger direct reclaim/compaction.
  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
  */
-static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
-		int referenced, int unmapped, struct collapse_control *cc)
+static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
+		int referenced, int unmapped, struct collapse_control *cc,
+		unsigned int order)
 {
+	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
+	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte = NULL;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	enum scan_result result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	bool anon_vma_locked = false;
 
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
+					 &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
 
-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
+						     referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
+					 &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
 	vma_start_write(vma);
-	result = check_pmd_still_valid(mm, address, pmd);
+	result = check_pmd_still_valid(mm, pmd_addr, pmd);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
 	anon_vma_lock_write(vma->anon_vma);
+	anon_vma_locked = true;
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
+				end_addr);
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
@@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * Parallel GUP-fast is fine since GUP-fast will back off when
 	 * it detects PMD is changed.
 	 */
-	_pmd = pmdp_collapse_flush(vma, address, pmd);
+	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      HPAGE_PMD_ORDER,
-						      &compound_pagelist);
+		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
+						      order, &compound_pagelist);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_NO_PTE_TABLE;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
-		BUG_ON(!pmd_none(*pmd));
+		WARN_ON_ONCE(!pmd_none(*pmd));
 		/*
 		 * We can only use set_pmd_at when establishing
 		 * hugepmds and never for establishing regular pmds that
@@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 		 */
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
-		anon_vma_unlock_write(vma->anon_vma);
 		goto out_up_write;
 	}
 
 	/*
-	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
+	 * For PMD collapse all pages are isolated and locked so anon_vma
+	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
+	 * removed and not all pages are isolated and locked, so we must hold
+	 * the lock to prevent neighboring folios from attempting to access
+	 * this PMD until its reinstalled.
 	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	if (is_pmd_order(order)) {
+		anon_vma_unlock_write(vma->anon_vma);
+		anon_vma_locked = false;
+	}
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   HPAGE_PMD_ORDER,
-					   &compound_pagelist);
-	pte_unmap(pte);
+					   vma, start_addr, pte_ptl,
+					   order, &compound_pagelist);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
 
@@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
 	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
+	WARN_ON_ONCE(!pmd_none(*pmd));
+	if (is_pmd_order(order)) {
+		pgtable = pmd_pgtable(_pmd);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
+	} else {
+		/*
+		 * set_ptes is called in map_anon_folio_pte_nopf with the
+		 * pmd_ptl lock still held; this is safe as the PMD is expected
+		 * to be none. The pmd entry is then repopulated below.
+		 */
+		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
+		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+	}
 	spin_unlock(pmd_ptl);
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
 out_up_write:
+	if (anon_vma_locked)
+		anon_vma_unlock_write(vma->anon_vma);
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
 out_nolock:
 	if (folio)
@@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		/* collapse_huge_page expects the lock to be dropped before calling */
 		mmap_read_unlock(mm);
 		result = collapse_huge_page(mm, start_addr, referenced,
-					    unmapped, cc);
+					    unmapped, cc, HPAGE_PMD_ORDER);
 		/* collapse_huge_page will return with the mmap_lock released */
 		*lock_dropped = true;
 	}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
@ 2026-05-22 21:47   ` David Hildenbrand (Arm)
  2026-05-26 14:42   ` Nico Pache
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-22 21:47 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Usama Arif

On 5/22/26 17:00, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
> 
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
> 
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

I guess we should add a comment here like:

/*
 * Only notify about the PTE range we will actually modify. While we
 * temporary unmap the whole PTE table for mTHP collapse, we'll remap
 * it later, leaving other PTEs effectively unmodified. The locks we hold
 * prevent anybody from stumbling over such temporarily unmapped PTE tables.
 */

>  
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> +				end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>  
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> +						      order, &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_NO_PTE_TABLE;
>  	}
>  
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);
>  		spin_lock(pmd_ptl);
> -		BUG_ON(!pmd_none(*pmd));
> +		WARN_ON_ONCE(!pmd_none(*pmd));

Likely VM_WARN_ON_ONCE is sufficient.

>  		/*
>  		 * We can only use set_pmd_at when establishing
>  		 * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>  
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> +	 * removed and not all pages are isolated and locked, so we must hold
> +	 * the lock to prevent neighboring folios from attempting to access
> +	 * this PMD until its reinstalled.
>  	 */

That makes sense. I was wondering whether there was another reason for dropping
the anon_vma lock ... I guess it was just for latency purposes given that there
was no actual need for the lock anymore once all folios in the range were
isolate+locked.


With the two its above addressed

Acked-by: David Hildenbrand (arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
  2026-05-22 21:47   ` David Hildenbrand (Arm)
@ 2026-05-26 14:42   ` Nico Pache
  2026-05-31  9:39   ` Lance Yang
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-26 14:42 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
  Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Usama Arif



On 5/22/26 9:00 AM, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
> 
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
> 
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

Hi Andrew can you please append the following fixup to this commit!

Changes are described below. 

Thank you :)

commit ed96f34ba40ffd2d6a5b54abe46fd6b480fc89af
Author: Nico Pache <npache@redhat.com>
Date:   Tue May 26 05:54:04 2026 -0600

    fixup: add a clarifying comment and change warn_on
    
    Add a clarifying comment describing how the locking/mmu notifer is handled
    and change the WARN_ON_ONCE to VM_WARN_ON_ONCE per davids suggestion.
    
    Acked-by: David Hildenbrand (arm) <david@kernel.org>
    Signed-off-by: Nico Pache <npache@redhat.com>

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9ab54397ae08..61d9494293b9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1283,6 +1283,13 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	anon_vma_lock_write(vma->anon_vma);
 	anon_vma_locked = true;
 
+	/*
+	 * Only notify about the PTE range we will actually modify. While we
+	 * temporary unmap the whole PTE table for mTHP collapse, we'll remap
+	 * it later, leaving other PTEs effectively unmodified. The locks we
+	 * hold prevent anybody from stumbling over such temporarily unmapped
+	 * PTE tables.
+	 */
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
 				end_addr);
 	mmu_notifier_invalidate_range_start(&range);
@@ -1348,7 +1355,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	 */
 	__folio_mark_uptodate(folio);
 	spin_lock(pmd_ptl);
-	WARN_ON_ONCE(!pmd_none(*pmd));
+	VM_WARN_ON_ONCE(!pmd_none(*pmd));
 	if (is_pmd_order(order)) {
 		pgtable = pmd_pgtable(_pmd);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);


>  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 55 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fab35d318641..d64f42f66236 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>   * while allocating a THP, as that could trigger direct reclaim/compaction.
>   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>   */
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -		int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		unsigned int order)
>  {
> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte = NULL;
>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	enum scan_result result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	bool anon_vma_locked = false;
>  
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
>  	}
>  
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> +						     referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
>  	vma_start_write(vma);
> -	result = check_pmd_still_valid(mm, address, pmd);
> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  
>  	anon_vma_lock_write(vma->anon_vma);
> +	anon_vma_locked = true;
>  
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> +				end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>  
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> +						      order, &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_NO_PTE_TABLE;
>  	}
>  
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);
>  		spin_lock(pmd_ptl);
> -		BUG_ON(!pmd_none(*pmd));
> +		WARN_ON_ONCE(!pmd_none(*pmd));
>  		/*
>  		 * We can only use set_pmd_at when establishing
>  		 * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>  
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> +	 * removed and not all pages are isolated and locked, so we must hold
> +	 * the lock to prevent neighboring folios from attempting to access
> +	 * this PMD until its reinstalled.
>  	 */
> -	anon_vma_unlock_write(vma->anon_vma);
> +	if (is_pmd_order(order)) {
> +		anon_vma_unlock_write(vma->anon_vma);
> +		anon_vma_locked = false;
> +	}
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   HPAGE_PMD_ORDER,
> -					   &compound_pagelist);
> -	pte_unmap(pte);
> +					   vma, start_addr, pte_ptl,
> +					   order, &compound_pagelist);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>  
> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
>  	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> +	WARN_ON_ONCE(!pmd_none(*pmd));
> +	if (is_pmd_order(order)) {
> +		pgtable = pmd_pgtable(_pmd);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> +	} else {
> +		/*
> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
> +		 * to be none. The pmd entry is then repopulated below.
> +		 */
> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +	}
>  	spin_unlock(pmd_ptl);
>  
>  	folio = NULL;
>  
>  	result = SCAN_SUCCEED;
>  out_up_write:
> +	if (anon_vma_locked)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
>  	if (folio)
> @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, HPAGE_PMD_ORDER);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*lock_dropped = true;
>  	}


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
  2026-05-22 21:47   ` David Hildenbrand (Arm)
  2026-05-26 14:42   ` Nico Pache
@ 2026-05-31  9:39   ` Lance Yang
  2026-05-31 20:00     ` David Hildenbrand (Arm)
  2026-06-04 10:21   ` Lorenzo Stoakes
  2026-06-04 11:38   ` Lorenzo Stoakes
  4 siblings, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-05-31  9:39 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif


On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>Pass an order and offset to collapse_huge_page to support collapsing anon
>memory to arbitrary orders within a PMD. order indicates what mTHP size we
>are attempting to collapse to, and offset indicates were in the PMD to
>start the collapse attempt.
>
>For non-PMD collapse we must leave the anon VMA write locked until after
>we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>the mTHP case this is not true, and we must keep the lock to prevent
>access/changes to the page tables. This can happen if the rmap walkers hit
>a pmd_none while the PMD entry is currently unavailable due to being
>temporarily removed during the collapse phase.
>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 55 insertions(+), 38 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index fab35d318641..d64f42f66236 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>  */
>-static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>-		int referenced, int unmapped, struct collapse_control *cc)
>+static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned int order)
> {
>+	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>+	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> 	LIST_HEAD(compound_pagelist);
> 	pmd_t *pmd, _pmd;
>-	pte_t *pte;
>+	pte_t *pte = NULL;
> 	pgtable_t pgtable;
> 	struct folio *folio;
> 	spinlock_t *pmd_ptl, *pte_ptl;
> 	enum scan_result result = SCAN_FAIL;
> 	struct vm_area_struct *vma;
> 	struct mmu_notifier_range range;
>+	bool anon_vma_locked = false;
> 
>-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>-
>-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>+	result = alloc_charge_folio(&folio, mm, cc, order);
> 	if (result != SCAN_SUCCEED)
> 		goto out_nolock;
> 
> 	mmap_read_lock(mm);
>-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>-					 HPAGE_PMD_ORDER);
>+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>+					 &vma, cc, order);
> 	if (result != SCAN_SUCCEED) {
> 		mmap_read_unlock(mm);
> 		goto out_nolock;
> 	}
> 
>-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>+	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> 	if (result != SCAN_SUCCEED) {
> 		mmap_read_unlock(mm);
> 		goto out_nolock;
>@@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 		 * released when it fails. So we jump out_nolock directly in
> 		 * that case.  Continuing to collapse causes inconsistency.
> 		 */
>-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>-						     referenced, HPAGE_PMD_ORDER);
>+		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>+						     referenced, order);
> 		if (result != SCAN_SUCCEED)
> 			goto out_nolock;
> 	}
>@@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * mmap_lock.
> 	 */
> 	mmap_write_lock(mm);
>-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>-					 HPAGE_PMD_ORDER);
>+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>+					 &vma, cc, order);
> 	if (result != SCAN_SUCCEED)
> 		goto out_up_write;
> 	/* check if the pmd is still valid */
> 	vma_start_write(vma);
>-	result = check_pmd_still_valid(mm, address, pmd);
>+	result = check_pmd_still_valid(mm, pmd_addr, pmd);
> 	if (result != SCAN_SUCCEED)
> 		goto out_up_write;
> 
> 	anon_vma_lock_write(vma->anon_vma);
>+	anon_vma_locked = true;
> 
>-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>-				address + HPAGE_PMD_SIZE);
>+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>+				end_addr);
> 	mmu_notifier_invalidate_range_start(&range);
> 
> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>@@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
> 	 * it detects PMD is changed.
> 	 */
>-	_pmd = pmdp_collapse_flush(vma, address, pmd);
>+	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> 	spin_unlock(pmd_ptl);
> 	mmu_notifier_invalidate_range_end(&range);
> 	tlb_remove_table_sync_one();
> 
>-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>+	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> 	if (pte) {
>-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>-						      HPAGE_PMD_ORDER,
>-						      &compound_pagelist);
>+		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>+						      order, &compound_pagelist);
> 		spin_unlock(pte_ptl);
> 	} else {
> 		result = SCAN_NO_PTE_TABLE;
> 	}
> 
> 	if (unlikely(result != SCAN_SUCCEED)) {
>-		if (pte)
>-			pte_unmap(pte);
> 		spin_lock(pmd_ptl);
>-		BUG_ON(!pmd_none(*pmd));
>+		WARN_ON_ONCE(!pmd_none(*pmd));
> 		/*
> 		 * We can only use set_pmd_at when establishing
> 		 * hugepmds and never for establishing regular pmds that
>@@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 		 */
> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> 		spin_unlock(pmd_ptl);
>-		anon_vma_unlock_write(vma->anon_vma);
> 		goto out_up_write;
> 	}
> 
> 	/*
>-	 * All pages are isolated and locked so anon_vma rmap
>-	 * can't run anymore.
>+	 * For PMD collapse all pages are isolated and locked so anon_vma
>+	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>+	 * removed and not all pages are isolated and locked, so we must hold
>+	 * the lock to prevent neighboring folios from attempting to access
>+	 * this PMD until its reinstalled.
> 	 */
>-	anon_vma_unlock_write(vma->anon_vma);
>+	if (is_pmd_order(order)) {
>+		anon_vma_unlock_write(vma->anon_vma);
>+		anon_vma_locked = false;
>+	}
> 
> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>-					   vma, address, pte_ptl,
>-					   HPAGE_PMD_ORDER,
>-					   &compound_pagelist);
>-	pte_unmap(pte);
>+					   vma, start_addr, pte_ptl,
>+					   order, &compound_pagelist);
> 	if (unlikely(result != SCAN_SUCCEED))
> 		goto out_up_write;
> 
>@@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * write.
> 	 */
> 	__folio_mark_uptodate(folio);
>-	pgtable = pmd_pgtable(_pmd);
>-
> 	spin_lock(pmd_ptl);
>-	BUG_ON(!pmd_none(*pmd));
>-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>-	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>+	WARN_ON_ONCE(!pmd_none(*pmd));
>+	if (is_pmd_order(order)) {
>+		pgtable = pmd_pgtable(_pmd);
>+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>+		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>+	} else {
>+		/*
>+		 * set_ptes is called in map_anon_folio_pte_nopf with the
>+		 * pmd_ptl lock still held; this is safe as the PMD is expected
>+		 * to be none. The pmd entry is then repopulated below.
>+		 */
>+		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);

Emm ... is it safe to use map_anon_folio_pte_nopf() here?

At this point pmdp_collapse_flush() has cleared the PMD from the page
tables. The PTE table we are updating is only reachable through the saved
old PMD value, _pmd, until pmd_populate() below.

map_anon_folio_pte_nopf() does set_ptes() and then calls
update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
that hook as:

"
	At the end of every page fault, this routine is invoked to tell
	the architecture specific code that translations now exists
	in the software page tables for address space "vma->vm_mm"
	at virtual address "address" for "nr" consecutive pages.
"

But that does not seem true here yet, since the PTE table is not
reachable from vma->vm_mm when update_mmu_cache_range() is called.

Should we avoid calling update_mmu_cache_range() until after the PTE
table is reinstalled with pmd_populate()?

Cheers, Lance

>+		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>+	}
> 	spin_unlock(pmd_ptl);
> 
> 	folio = NULL;
> 
> 	result = SCAN_SUCCEED;
> out_up_write:
>+	if (anon_vma_locked)
>+		anon_vma_unlock_write(vma->anon_vma);
>+	if (pte)
>+		pte_unmap(pte);
> 	mmap_write_unlock(mm);
> out_nolock:
> 	if (folio)
>@@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
> 		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc);
>+					    unmapped, cc, HPAGE_PMD_ORDER);
> 		/* collapse_huge_page will return with the mmap_lock released */
> 		*lock_dropped = true;
> 	}
>-- 
>2.54.0
>
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-31  9:39   ` Lance Yang
@ 2026-05-31 20:00     ` David Hildenbrand (Arm)
  2026-06-01  3:28       ` Lance Yang
  0 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-31 20:00 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 5/31/26 11:39, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>> 1 file changed, 55 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index fab35d318641..d64f42f66236 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>>  */
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -		int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +		int referenced, int unmapped, struct collapse_control *cc,
>> +		unsigned int order)
>> {
>> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>> 	LIST_HEAD(compound_pagelist);
>> 	pmd_t *pmd, _pmd;
>> -	pte_t *pte;
>> +	pte_t *pte = NULL;
>> 	pgtable_t pgtable;
>> 	struct folio *folio;
>> 	spinlock_t *pmd_ptl, *pte_ptl;
>> 	enum scan_result result = SCAN_FAIL;
>> 	struct vm_area_struct *vma;
>> 	struct mmu_notifier_range range;
>> +	bool anon_vma_locked = false;
>>
>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_nolock;
>>
>> 	mmap_read_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> 	}
>>
>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 * released when it fails. So we jump out_nolock directly in
>> 		 * that case.  Continuing to collapse causes inconsistency.
>> 		 */
>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -						     referenced, HPAGE_PMD_ORDER);
>> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>> +						     referenced, order);
>> 		if (result != SCAN_SUCCEED)
>> 			goto out_nolock;
>> 	}
>> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * mmap_lock.
>> 	 */
>> 	mmap_write_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>> 	/* check if the pmd is still valid */
>> 	vma_start_write(vma);
>> -	result = check_pmd_still_valid(mm, address, pmd);
>> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>>
>> 	anon_vma_lock_write(vma->anon_vma);
>> +	anon_vma_locked = true;
>>
>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> -				address + HPAGE_PMD_SIZE);
>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>> +				end_addr);
>> 	mmu_notifier_invalidate_range_start(&range);
>>
>> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
>> 	 * it detects PMD is changed.
>> 	 */
>> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
>> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>> 	spin_unlock(pmd_ptl);
>> 	mmu_notifier_invalidate_range_end(&range);
>> 	tlb_remove_table_sync_one();
>>
>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>> 	if (pte) {
>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -						      HPAGE_PMD_ORDER,
>> -						      &compound_pagelist);
>> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>> +						      order, &compound_pagelist);
>> 		spin_unlock(pte_ptl);
>> 	} else {
>> 		result = SCAN_NO_PTE_TABLE;
>> 	}
>>
>> 	if (unlikely(result != SCAN_SUCCEED)) {
>> -		if (pte)
>> -			pte_unmap(pte);
>> 		spin_lock(pmd_ptl);
>> -		BUG_ON(!pmd_none(*pmd));
>> +		WARN_ON_ONCE(!pmd_none(*pmd));
>> 		/*
>> 		 * We can only use set_pmd_at when establishing
>> 		 * hugepmds and never for establishing regular pmds that
>> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 */
>> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> 		spin_unlock(pmd_ptl);
>> -		anon_vma_unlock_write(vma->anon_vma);
>> 		goto out_up_write;
>> 	}
>>
>> 	/*
>> -	 * All pages are isolated and locked so anon_vma rmap
>> -	 * can't run anymore.
>> +	 * For PMD collapse all pages are isolated and locked so anon_vma
>> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>> +	 * removed and not all pages are isolated and locked, so we must hold
>> +	 * the lock to prevent neighboring folios from attempting to access
>> +	 * this PMD until its reinstalled.
>> 	 */
>> -	anon_vma_unlock_write(vma->anon_vma);
>> +	if (is_pmd_order(order)) {
>> +		anon_vma_unlock_write(vma->anon_vma);
>> +		anon_vma_locked = false;
>> +	}
>>
>> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>> -					   vma, address, pte_ptl,
>> -					   HPAGE_PMD_ORDER,
>> -					   &compound_pagelist);
>> -	pte_unmap(pte);
>> +					   vma, start_addr, pte_ptl,
>> +					   order, &compound_pagelist);
>> 	if (unlikely(result != SCAN_SUCCEED))
>> 		goto out_up_write;
>>
>> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * write.
>> 	 */
>> 	__folio_mark_uptodate(folio);
>> -	pgtable = pmd_pgtable(_pmd);
>> -
>> 	spin_lock(pmd_ptl);
>> -	BUG_ON(!pmd_none(*pmd));
>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +	WARN_ON_ONCE(!pmd_none(*pmd));
>> +	if (is_pmd_order(order)) {
>> +		pgtable = pmd_pgtable(_pmd);
>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +	} else {
>> +		/*
>> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
>> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>> +		 * to be none. The pmd entry is then repopulated below.
>> +		 */
>> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> 
> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
> 
> At this point pmdp_collapse_flush() has cleared the PMD from the page
> tables. The PTE table we are updating is only reachable through the saved
> old PMD value, _pmd, until pmd_populate() below.
> 
> map_anon_folio_pte_nopf() does set_ptes() and then calls
> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
> that hook as:
> 
> "
> 	At the end of every page fault, this routine is invoked to tell
> 	the architecture specific code that translations now exists
> 	in the software page tables for address space "vma->vm_mm"
> 	at virtual address "address" for "nr" consecutive pages.
> "
> 
> But that does not seem true here yet, since the PTE table is not
> reachable from vma->vm_mm when update_mmu_cache_range() is called.
> 
> Should we avoid calling update_mmu_cache_range() until after the PTE
> table is reinstalled with pmd_populate()?

I recall that update_mmu_cache* users mostly care about updating folios flags,
for the folio derived from the PTE ... or flushing caches for the user address.

So intuitively I would say "the architecture code doesn't care that the PMD
table will only be visible to HW shortly after". The important thing should be
that it will definetly happen, and that nothing else is curently there or can be
there?



-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-31 20:00     ` David Hildenbrand (Arm)
@ 2026-06-01  3:28       ` Lance Yang
  2026-06-01  6:54         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-06-01  3:28 UTC (permalink / raw)
  To: david
  Cc: lance.yang, npache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif


On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>On 5/31/26 11:39, Lance Yang wrote:
>> 
>> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>>> Pass an order and offset to collapse_huge_page to support collapsing anon
>>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>>> are attempting to collapse to, and offset indicates were in the PMD to
>>> start the collapse attempt.
>>>
>>> For non-PMD collapse we must leave the anon VMA write locked until after
>>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>>> the mTHP case this is not true, and we must keep the lock to prevent
>>> access/changes to the page tables. This can happen if the rmap walkers hit
>>> a pmd_none while the PMD entry is currently unavailable due to being
>>> temporarily removed during the collapse phase.
>>>
>>> Acked-by: Usama Arif <usama.arif@linux.dev>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>>> 1 file changed, 55 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index fab35d318641..d64f42f66236 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>>>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>>>  */
>>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>> -		int referenced, int unmapped, struct collapse_control *cc)
>>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>>> +		int referenced, int unmapped, struct collapse_control *cc,
>>> +		unsigned int order)
>>> {
>>> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>>> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>>> 	LIST_HEAD(compound_pagelist);
>>> 	pmd_t *pmd, _pmd;
>>> -	pte_t *pte;
>>> +	pte_t *pte = NULL;
>>> 	pgtable_t pgtable;
>>> 	struct folio *folio;
>>> 	spinlock_t *pmd_ptl, *pte_ptl;
>>> 	enum scan_result result = SCAN_FAIL;
>>> 	struct vm_area_struct *vma;
>>> 	struct mmu_notifier_range range;
>>> +	bool anon_vma_locked = false;
>>>
>>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>> -
>>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_nolock;
>>>
>>> 	mmap_read_lock(mm);
>>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>> -					 HPAGE_PMD_ORDER);
>>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>>> +					 &vma, cc, order);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		mmap_read_unlock(mm);
>>> 		goto out_nolock;
>>> 	}
>>>
>>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		mmap_read_unlock(mm);
>>> 		goto out_nolock;
>>> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 		 * released when it fails. So we jump out_nolock directly in
>>> 		 * that case.  Continuing to collapse causes inconsistency.
>>> 		 */
>>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>>> -						     referenced, HPAGE_PMD_ORDER);
>>> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>>> +						     referenced, order);
>>> 		if (result != SCAN_SUCCEED)
>>> 			goto out_nolock;
>>> 	}
>>> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * mmap_lock.
>>> 	 */
>>> 	mmap_write_lock(mm);
>>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>> -					 HPAGE_PMD_ORDER);
>>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>>> +					 &vma, cc, order);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_up_write;
>>> 	/* check if the pmd is still valid */
>>> 	vma_start_write(vma);
>>> -	result = check_pmd_still_valid(mm, address, pmd);
>>> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_up_write;
>>>
>>> 	anon_vma_lock_write(vma->anon_vma);
>>> +	anon_vma_locked = true;
>>>
>>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -				address + HPAGE_PMD_SIZE);
>>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>>> +				end_addr);
>>> 	mmu_notifier_invalidate_range_start(&range);
>>>
>>> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
>>> 	 * it detects PMD is changed.
>>> 	 */
>>> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
>>> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>>> 	spin_unlock(pmd_ptl);
>>> 	mmu_notifier_invalidate_range_end(&range);
>>> 	tlb_remove_table_sync_one();
>>>
>>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>>> 	if (pte) {
>>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>>> -						      HPAGE_PMD_ORDER,
>>> -						      &compound_pagelist);
>>> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>>> +						      order, &compound_pagelist);
>>> 		spin_unlock(pte_ptl);
>>> 	} else {
>>> 		result = SCAN_NO_PTE_TABLE;
>>> 	}
>>>
>>> 	if (unlikely(result != SCAN_SUCCEED)) {
>>> -		if (pte)
>>> -			pte_unmap(pte);
>>> 		spin_lock(pmd_ptl);
>>> -		BUG_ON(!pmd_none(*pmd));
>>> +		WARN_ON_ONCE(!pmd_none(*pmd));
>>> 		/*
>>> 		 * We can only use set_pmd_at when establishing
>>> 		 * hugepmds and never for establishing regular pmds that
>>> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 		 */
>>> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>> 		spin_unlock(pmd_ptl);
>>> -		anon_vma_unlock_write(vma->anon_vma);
>>> 		goto out_up_write;
>>> 	}
>>>
>>> 	/*
>>> -	 * All pages are isolated and locked so anon_vma rmap
>>> -	 * can't run anymore.
>>> +	 * For PMD collapse all pages are isolated and locked so anon_vma
>>> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>>> +	 * removed and not all pages are isolated and locked, so we must hold
>>> +	 * the lock to prevent neighboring folios from attempting to access
>>> +	 * this PMD until its reinstalled.
>>> 	 */
>>> -	anon_vma_unlock_write(vma->anon_vma);
>>> +	if (is_pmd_order(order)) {
>>> +		anon_vma_unlock_write(vma->anon_vma);
>>> +		anon_vma_locked = false;
>>> +	}
>>>
>>> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>> -					   vma, address, pte_ptl,
>>> -					   HPAGE_PMD_ORDER,
>>> -					   &compound_pagelist);
>>> -	pte_unmap(pte);
>>> +					   vma, start_addr, pte_ptl,
>>> +					   order, &compound_pagelist);
>>> 	if (unlikely(result != SCAN_SUCCEED))
>>> 		goto out_up_write;
>>>
>>> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * write.
>>> 	 */
>>> 	__folio_mark_uptodate(folio);
>>> -	pgtable = pmd_pgtable(_pmd);
>>> -
>>> 	spin_lock(pmd_ptl);
>>> -	BUG_ON(!pmd_none(*pmd));
>>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>>> +	WARN_ON_ONCE(!pmd_none(*pmd));
>>> +	if (is_pmd_order(order)) {
>>> +		pgtable = pmd_pgtable(_pmd);
>>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>>> +	} else {
>>> +		/*
>>> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
>>> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>>> +		 * to be none. The pmd entry is then repopulated below.
>>> +		 */
>>> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>> 
>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>> 
>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>> tables. The PTE table we are updating is only reachable through the saved
>> old PMD value, _pmd, until pmd_populate() below.
>> 
>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>> that hook as:
>> 
>> "
>> 	At the end of every page fault, this routine is invoked to tell
>> 	the architecture specific code that translations now exists
>> 	in the software page tables for address space "vma->vm_mm"
>> 	at virtual address "address" for "nr" consecutive pages.
>> "
>> 
>> But that does not seem true here yet, since the PTE table is not
>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>> 
>> Should we avoid calling update_mmu_cache_range() until after the PTE
>> table is reinstalled with pmd_populate()?
>
>I recall that update_mmu_cache* users mostly care about updating folios flags,
>for the folio derived from the PTE ... or flushing caches for the user address.
>
>So intuitively I would say "the architecture code doesn't care that the PMD
>table will only be visible to HW shortly after". The important thing should be
>that it will definetly happen, and that nothing else is curently there or can be
>there?

Ah, fair point.

I was mostly worried about arch hooks that walk vma->vm_mm again, rather
than only using the pte pointer passed in. For example, mips does:

  update_mmu_cache_range()
    -> __update_tlb()
      -> pgd_offset(vma->vm_mm, address)
      -> pte_offset_map(...)

and __update_tlb() has this assumption:

		/*
		 * update_mmu_cache() is called between pte_offset_map_lock()
		 * and pte_unmap_unlock(), so we can assume that ptep is not
		 * NULL here: and what should be done below if it were NULL?
		 */

So if khugepaged happens to run with current->active_mm == vma->vm_mm
here, could __update_tlb() hit the none PMD, get NULL from
pte_offset_map(), and then dereference it?

Just wanted to raise it since some arch code may still have assumptions
like this, and the always-enable-mTHP work is getting closer ...

Probably very very very hard to hit, though :)

Cheers, Lance

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  3:28       ` Lance Yang
@ 2026-06-01  6:54         ` David Hildenbrand (Arm)
  2026-06-01  7:49           ` Lance Yang
                             ` (2 more replies)
  0 siblings, 3 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01  6:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 6/1/26 05:28, Lance Yang wrote:
> 
> On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/31/26 11:39, Lance Yang wrote:
>>>
>>>
>>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>>>
>>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>>> tables. The PTE table we are updating is only reachable through the saved
>>> old PMD value, _pmd, until pmd_populate() below.
>>>
>>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>>> that hook as:
>>>
>>> "
>>> 	At the end of every page fault, this routine is invoked to tell
>>> 	the architecture specific code that translations now exists
>>> 	in the software page tables for address space "vma->vm_mm"
>>> 	at virtual address "address" for "nr" consecutive pages.
>>> "
>>>
>>> But that does not seem true here yet, since the PTE table is not
>>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>>>
>>> Should we avoid calling update_mmu_cache_range() until after the PTE
>>> table is reinstalled with pmd_populate()?
>>
>> I recall that update_mmu_cache* users mostly care about updating folios flags,
>> for the folio derived from the PTE ... or flushing caches for the user address.
>>
>> So intuitively I would say "the architecture code doesn't care that the PMD
>> table will only be visible to HW shortly after". The important thing should be
>> that it will definetly happen, and that nothing else is curently there or can be
>> there?
> 
> Ah, fair point.
> 
> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> than only using the pte pointer passed in. For example, mips does:

Right, a re-walk would be the real problem.

> 
>   update_mmu_cache_range()
>     -> __update_tlb()
>       -> pgd_offset(vma->vm_mm, address)
>       -> pte_offset_map(...)
> 
> and __update_tlb() has this assumption:
> 
> 		/*
> 		 * update_mmu_cache() is called between pte_offset_map_lock()
> 		 * and pte_unmap_unlock(), so we can assume that ptep is not
> 		 * NULL here: and what should be done below if it were NULL?
> 		 */
> 
> So if khugepaged happens to run with current->active_mm == vma->vm_mm
> here, could __update_tlb() hit the none PMD, get NULL from
> pte_offset_map(), and then dereference it?

Likely yes -- that MIPS code is horrible. And the comment in MIPS code
even spells that out. :(

Do you know about other code like that, or is MIPS the only one doing a
re-walk and crossing fingers?

> 
> Just wanted to raise it since some arch code may still have assumptions
> like this, and the always-enable-mTHP work is getting closer ...

Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.

> 
> Probably very very very hard to hit, though :)

Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
nobody can interfere in the meantime ... and the PMD lock will not be sufficient.

Maybe we could reinstall the page table with the cleared (none) entries while
still holding the PTL?

Thinking out loud:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5ba298d420b7..e39b750b1e6f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
                map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
        } else {
                /*
-                * set_ptes is called in map_anon_folio_pte_nopf with the
-                * pmd_ptl lock still held; this is safe as the PMD is expected
-                * to be none. The pmd entry is then repopulated below.
+                * Re-insert the page table with the cleared entries, but
+                * hold the PTL, such that no one can mess with the re-installed
+                * page table until we updated the temporarily-cleared entries
+                * through map_anon_folio_pte_nopf().
                 */
-               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
-               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+               if (pte_ptl != pmd_ptl)
+                       spin_lock(pte_ptl);
                pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
+               if (pte_ptl != pmd_ptl)
+                       spin_unlock(pte_ptl);
        }
        spin_unlock(pmd_ptl);
 


-- 
Cheers,

David

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  6:54         ` David Hildenbrand (Arm)
@ 2026-06-01  7:49           ` Lance Yang
  2026-06-01  8:15             ` David Hildenbrand (Arm)
  2026-06-01  9:08           ` Lance Yang
  2026-06-04 12:33           ` Lorenzo Stoakes
  2 siblings, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-06-01  7:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif



On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
> On 6/1/26 05:28, Lance Yang wrote:
>>
>> On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>>> On 5/31/26 11:39, Lance Yang wrote:
>>>>
>>>>
>>>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>>>>
>>>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>>>> tables. The PTE table we are updating is only reachable through the saved
>>>> old PMD value, _pmd, until pmd_populate() below.
>>>>
>>>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>>>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>>>> that hook as:
>>>>
>>>> "
>>>> 	At the end of every page fault, this routine is invoked to tell
>>>> 	the architecture specific code that translations now exists
>>>> 	in the software page tables for address space "vma->vm_mm"
>>>> 	at virtual address "address" for "nr" consecutive pages.
>>>> "
>>>>
>>>> But that does not seem true here yet, since the PTE table is not
>>>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>>>>
>>>> Should we avoid calling update_mmu_cache_range() until after the PTE
>>>> table is reinstalled with pmd_populate()?
>>>
>>> I recall that update_mmu_cache* users mostly care about updating folios flags,
>>> for the folio derived from the PTE ... or flushing caches for the user address.
>>>
>>> So intuitively I would say "the architecture code doesn't care that the PMD
>>> table will only be visible to HW shortly after". The important thing should be
>>> that it will definetly happen, and that nothing else is curently there or can be
>>> there?
>>
>> Ah, fair point.
>>
>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>> than only using the pte pointer passed in. For example, mips does:
> 
> Right, a re-walk would be the real problem.
> 
>>
>>    update_mmu_cache_range()
>>      -> __update_tlb()
>>        -> pgd_offset(vma->vm_mm, address)
>>        -> pte_offset_map(...)
>>
>> and __update_tlb() has this assumption:
>>
>> 		/*
>> 		 * update_mmu_cache() is called between pte_offset_map_lock()
>> 		 * and pte_unmap_unlock(), so we can assume that ptep is not
>> 		 * NULL here: and what should be done below if it were NULL?
>> 		 */
>>
>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>> here, could __update_tlb() hit the none PMD, get NULL from
>> pte_offset_map(), and then dereference it?
> 
> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> even spells that out. :(
> 
> Do you know about other code like that, or is MIPS the only one doing a
> re-walk and crossing fingers?

I had Codex do the boring grep-work through the arch update_mmu_cache*
code :D

MIPS doesn't seem to be the only code doing a re-walk, but it is the
only one I found that appears to assume the PMD/PTE walk cannot fail,
without checking whether the PMD is none ...

Cheers, Lance

>>
>> Just wanted to raise it since some arch code may still have assumptions
>> like this, and the always-enable-mTHP work is getting closer ...
> 
> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
> because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.
> 
>>
>> Probably very very very hard to hit, though :)
> 
> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
> 
> Maybe we could reinstall the page table with the cleared (none) entries while
> still holding the PTL?
> 
> Thinking out loud:
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5ba298d420b7..e39b750b1e6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>                  map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>          } else {
>                  /*
> -                * set_ptes is called in map_anon_folio_pte_nopf with the
> -                * pmd_ptl lock still held; this is safe as the PMD is expected
> -                * to be none. The pmd entry is then repopulated below.
> +                * Re-insert the page table with the cleared entries, but
> +                * hold the PTL, such that no one can mess with the re-installed
> +                * page table until we updated the temporarily-cleared entries
> +                * through map_anon_folio_pte_nopf().
>                   */
> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +               if (pte_ptl != pmd_ptl)
> +                       spin_lock(pte_ptl);
>                  pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +               if (pte_ptl != pmd_ptl)
> +                       spin_unlock(pte_ptl);
>          }
>          spin_unlock(pmd_ptl);
>   
> 
> 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  7:49           ` Lance Yang
@ 2026-06-01  8:15             ` David Hildenbrand (Arm)
  2026-06-01  8:44               ` Lance Yang
  0 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01  8:15 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 6/1/26 09:49, Lance Yang wrote:
> 
> 
> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>> On 6/1/26 05:28, Lance Yang wrote:
>>>
>>>
>>> Ah, fair point.
>>>
>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>>> than only using the pte pointer passed in. For example, mips does:
>>
>> Right, a re-walk would be the real problem.
>>
>>>
>>>    update_mmu_cache_range()
>>>      -> __update_tlb()
>>>        -> pgd_offset(vma->vm_mm, address)
>>>        -> pte_offset_map(...)
>>>
>>> and __update_tlb() has this assumption:
>>>
>>>         /*
>>>          * update_mmu_cache() is called between pte_offset_map_lock()
>>>          * and pte_unmap_unlock(), so we can assume that ptep is not
>>>          * NULL here: and what should be done below if it were NULL?
>>>          */
>>>
>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>>> here, could __update_tlb() hit the none PMD, get NULL from
>>> pte_offset_map(), and then dereference it?
>>
>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>> even spells that out. :(
>>
>> Do you know about other code like that, or is MIPS the only one doing a
>> re-walk and crossing fingers?
> 
> I had Codex do the boring grep-work through the arch update_mmu_cache*
> code :D
> 
> MIPS doesn't seem to be the only code doing a re-walk, but it is the
> only one I found that appears to assume the PMD/PTE walk cannot fail,
> without checking whether the PMD is none ...

Okay, but likely the other code that tries to handle it is also problematic.

Best to make sure the page table is already installed when updating the entries.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  8:15             ` David Hildenbrand (Arm)
@ 2026-06-01  8:44               ` Lance Yang
  2026-06-01 10:09                 ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-06-01  8:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif



On 2026/6/1 16:15, David Hildenbrand (Arm) wrote:
> On 6/1/26 09:49, Lance Yang wrote:
>>
>>
>> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>>> On 6/1/26 05:28, Lance Yang wrote:
>>>>
>>>>
>>>> Ah, fair point.
>>>>
>>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>>>> than only using the pte pointer passed in. For example, mips does:
>>>
>>> Right, a re-walk would be the real problem.
>>>
>>>>
>>>>     update_mmu_cache_range()
>>>>       -> __update_tlb()
>>>>         -> pgd_offset(vma->vm_mm, address)
>>>>         -> pte_offset_map(...)
>>>>
>>>> and __update_tlb() has this assumption:
>>>>
>>>>          /*
>>>>           * update_mmu_cache() is called between pte_offset_map_lock()
>>>>           * and pte_unmap_unlock(), so we can assume that ptep is not
>>>>           * NULL here: and what should be done below if it were NULL?
>>>>           */
>>>>
>>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>>>> here, could __update_tlb() hit the none PMD, get NULL from
>>>> pte_offset_map(), and then dereference it?
>>>
>>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>>> even spells that out. :(
>>>
>>> Do you know about other code like that, or is MIPS the only one doing a
>>> re-walk and crossing fingers?
>>
>> I had Codex do the boring grep-work through the arch update_mmu_cache*
>> code :D
>>
>> MIPS doesn't seem to be the only code doing a re-walk, but it is the
>> only one I found that appears to assume the PMD/PTE walk cannot fail,
>> without checking whether the PMD is none ...
> 
> Okay, but likely the other code that tries to handle it is also problematic.
> 
> Best to make sure the page table is already installed when updating the entries.

Neat, makes sense to me :D

That way the page talbe is back in place before any arch hook gets to 
look at it :)


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  8:44               ` Lance Yang
@ 2026-06-01 10:09                 ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01 10:09 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 6/1/26 10:44, Lance Yang wrote:
> 
> 
> On 2026/6/1 16:15, David Hildenbrand (Arm) wrote:
>> On 6/1/26 09:49, Lance Yang wrote:
>>>
>>>
>>>
>>> I had Codex do the boring grep-work through the arch update_mmu_cache*
>>> code :D
>>>
>>> MIPS doesn't seem to be the only code doing a re-walk, but it is the
>>> only one I found that appears to assume the PMD/PTE walk cannot fail,
>>> without checking whether the PMD is none ...
>>
>> Okay, but likely the other code that tries to handle it is also problematic.
>>
>> Best to make sure the page table is already installed when updating the entries.
> 
> Neat, makes sense to me :D
> 
> That way the page talbe is back in place before any arch hook gets to look at it :)

Right. I don't think we could run into a deadlock here (nobody should
concurrently take a look at the page tables in the first place).

Not sure about the memory barrier I dropped: the page tables are already
properly set up (just some entries cleared), so I'd assume that barrier might
not be required.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  6:54         ` David Hildenbrand (Arm)
  2026-06-01  7:49           ` Lance Yang
@ 2026-06-01  9:08           ` Lance Yang
  2026-06-01 10:23             ` David Hildenbrand (Arm)
  2026-06-04 12:33           ` Lorenzo Stoakes
  2 siblings, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-06-01  9:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif



On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
> On 6/1/26 05:28, Lance Yang wrote:
>>
>> On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>>> On 5/31/26 11:39, Lance Yang wrote:
>>>>
>>>>
>>>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>>>>
>>>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>>>> tables. The PTE table we are updating is only reachable through the saved
>>>> old PMD value, _pmd, until pmd_populate() below.
>>>>
>>>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>>>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>>>> that hook as:
>>>>
>>>> "
>>>> 	At the end of every page fault, this routine is invoked to tell
>>>> 	the architecture specific code that translations now exists
>>>> 	in the software page tables for address space "vma->vm_mm"
>>>> 	at virtual address "address" for "nr" consecutive pages.
>>>> "
>>>>
>>>> But that does not seem true here yet, since the PTE table is not
>>>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>>>>
>>>> Should we avoid calling update_mmu_cache_range() until after the PTE
>>>> table is reinstalled with pmd_populate()?
>>>
>>> I recall that update_mmu_cache* users mostly care about updating folios flags,
>>> for the folio derived from the PTE ... or flushing caches for the user address.
>>>
>>> So intuitively I would say "the architecture code doesn't care that the PMD
>>> table will only be visible to HW shortly after". The important thing should be
>>> that it will definetly happen, and that nothing else is curently there or can be
>>> there?
>>
>> Ah, fair point.
>>
>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>> than only using the pte pointer passed in. For example, mips does:
> 
> Right, a re-walk would be the real problem.
> 
>>
>>    update_mmu_cache_range()
>>      -> __update_tlb()
>>        -> pgd_offset(vma->vm_mm, address)
>>        -> pte_offset_map(...)
>>
>> and __update_tlb() has this assumption:
>>
>> 		/*
>> 		 * update_mmu_cache() is called between pte_offset_map_lock()
>> 		 * and pte_unmap_unlock(), so we can assume that ptep is not
>> 		 * NULL here: and what should be done below if it were NULL?
>> 		 */
>>
>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>> here, could __update_tlb() hit the none PMD, get NULL from
>> pte_offset_map(), and then dereference it?
> 
> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> even spells that out. :(
> 
> Do you know about other code like that, or is MIPS the only one doing a
> re-walk and crossing fingers?
> 
>>
>> Just wanted to raise it since some arch code may still have assumptions
>> like this, and the always-enable-mTHP work is getting closer ...
> 
> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
> because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.
> 
>>
>> Probably very very very hard to hit, though :)
> 
> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
> 
> Maybe we could reinstall the page table with the cleared (none) entries while
> still holding the PTL?
> 
> Thinking out loud:
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5ba298d420b7..e39b750b1e6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>                  map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>          } else {
>                  /*
> -                * set_ptes is called in map_anon_folio_pte_nopf with the
> -                * pmd_ptl lock still held; this is safe as the PMD is expected
> -                * to be none. The pmd entry is then repopulated below.
> +                * Re-insert the page table with the cleared entries, but
> +                * hold the PTL, such that no one can mess with the re-installed
> +                * page table until we updated the temporarily-cleared entries
> +                * through map_anon_folio_pte_nopf().
>                   */
> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */

One small thing, I think we should probably keep the smp_wmb(), and just
move it before the earlier pmd_populate().

IIUC, the ordering we want is still:

   clear old PTEs
   smp_wmb()
   pmd_populate()

so another CPU cannot walk through the re-installed PMD and still observe
the old PTEs, right?

> +               if (pte_ptl != pmd_ptl)
> +                       spin_lock(pte_ptl);
>                  pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +               if (pte_ptl != pmd_ptl)
> +                       spin_unlock(pte_ptl);
>          }
>          spin_unlock(pmd_ptl);
>   

Cheers, Lance


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  9:08           ` Lance Yang
@ 2026-06-01 10:23             ` David Hildenbrand (Arm)
  2026-06-01 10:47               ` Lance Yang
  0 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01 10:23 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 6/1/26 11:08, Lance Yang wrote:
> 
> 
> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>> On 6/1/26 05:28, Lance Yang wrote:
>>>
>>>
>>> Ah, fair point.
>>>
>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>>> than only using the pte pointer passed in. For example, mips does:
>>
>> Right, a re-walk would be the real problem.
>>
>>>
>>>    update_mmu_cache_range()
>>>      -> __update_tlb()
>>>        -> pgd_offset(vma->vm_mm, address)
>>>        -> pte_offset_map(...)
>>>
>>> and __update_tlb() has this assumption:
>>>
>>>         /*
>>>          * update_mmu_cache() is called between pte_offset_map_lock()
>>>          * and pte_unmap_unlock(), so we can assume that ptep is not
>>>          * NULL here: and what should be done below if it were NULL?
>>>          */
>>>
>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>>> here, could __update_tlb() hit the none PMD, get NULL from
>>> pte_offset_map(), and then dereference it?
>>
>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>> even spells that out. :(
>>
>> Do you know about other code like that, or is MIPS the only one doing a
>> re-walk and crossing fingers?
>>
>>>
>>> Just wanted to raise it since some arch code may still have assumptions
>>> like this, and the always-enable-mTHP work is getting closer ...
>>
>> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in
>> arch code,
>> because we simply provide the ptep. update_mmu_cache_range() only consumes the
>> pte.
>>
>>>
>>> Probably very very very hard to hit, though :)
>>
>> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
>> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
>>
>> Maybe we could reinstall the page table with the cleared (none) entries while
>> still holding the PTL?
>>
>> Thinking out loud:
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 5ba298d420b7..e39b750b1e6f 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct
>> mm_struct *mm, unsigned long s
>>                  map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>>          } else {
>>                  /*
>> -                * set_ptes is called in map_anon_folio_pte_nopf with the
>> -                * pmd_ptl lock still held; this is safe as the PMD is expected
>> -                * to be none. The pmd entry is then repopulated below.
>> +                * Re-insert the page table with the cleared entries, but
>> +                * hold the PTL, such that no one can mess with the re-installed
>> +                * page table until we updated the temporarily-cleared entries
>> +                * through map_anon_folio_pte_nopf().
>>                   */
>> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /
>> *uffd_wp=*/ false);
>> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> 
> One small thing, I think we should probably keep the smp_wmb(), and just
> move it before the earlier pmd_populate().
> 
> IIUC, the ordering we want is still:
> 
>   clear old PTEs
>   smp_wmb()
>   pmd_populate()
> 
> so another CPU cannot walk through the re-installed PMD and still observe
> the old PTEs, right?

There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01 10:23             ` David Hildenbrand (Arm)
@ 2026-06-01 10:47               ` Lance Yang
  2026-06-01 11:13                 ` David Hildenbrand (Arm)
  2026-06-02 15:30                 ` Nico Pache
  0 siblings, 2 replies; 114+ messages in thread
From: Lance Yang @ 2026-06-01 10:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif



On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> On 6/1/26 11:08, Lance Yang wrote:
>>
>>
>> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>>> On 6/1/26 05:28, Lance Yang wrote:
>>>>
>>>>
>>>> Ah, fair point.
>>>>
>>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>>>> than only using the pte pointer passed in. For example, mips does:
>>>
>>> Right, a re-walk would be the real problem.
>>>
>>>>
>>>>     update_mmu_cache_range()
>>>>       -> __update_tlb()
>>>>         -> pgd_offset(vma->vm_mm, address)
>>>>         -> pte_offset_map(...)
>>>>
>>>> and __update_tlb() has this assumption:
>>>>
>>>>          /*
>>>>           * update_mmu_cache() is called between pte_offset_map_lock()
>>>>           * and pte_unmap_unlock(), so we can assume that ptep is not
>>>>           * NULL here: and what should be done below if it were NULL?
>>>>           */
>>>>
>>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>>>> here, could __update_tlb() hit the none PMD, get NULL from
>>>> pte_offset_map(), and then dereference it?
>>>
>>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>>> even spells that out. :(
>>>
>>> Do you know about other code like that, or is MIPS the only one doing a
>>> re-walk and crossing fingers?
>>>
>>>>
>>>> Just wanted to raise it since some arch code may still have assumptions
>>>> like this, and the always-enable-mTHP work is getting closer ...
>>>
>>> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in
>>> arch code,
>>> because we simply provide the ptep. update_mmu_cache_range() only consumes the
>>> pte.
>>>
>>>>
>>>> Probably very very very hard to hit, though :)
>>>
>>> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
>>> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
>>>
>>> Maybe we could reinstall the page table with the cleared (none) entries while
>>> still holding the PTL?
>>>
>>> Thinking out loud:
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 5ba298d420b7..e39b750b1e6f 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct
>>> mm_struct *mm, unsigned long s
>>>                   map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>>>           } else {
>>>                   /*
>>> -                * set_ptes is called in map_anon_folio_pte_nopf with the
>>> -                * pmd_ptl lock still held; this is safe as the PMD is expected
>>> -                * to be none. The pmd entry is then repopulated below.
>>> +                * Re-insert the page table with the cleared entries, but
>>> +                * hold the PTL, such that no one can mess with the re-installed
>>> +                * page table until we updated the temporarily-cleared entries
>>> +                * through map_anon_folio_pte_nopf().
>>>                    */
>>> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /
>>> *uffd_wp=*/ false);
>>> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>>
>> One small thing, I think we should probably keep the smp_wmb(), and just
>> move it before the earlier pmd_populate().
>>
>> IIUC, the ordering we want is still:
>>
>>    clear old PTEs
>>    smp_wmb()
>>    pmd_populate()
>>
>> so another CPU cannot walk through the re-installed PMD and still observe
>> the old PTEs, right?
> 
> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?

Ah, cool! __folio_mark_uptodate() already does the job :P

So yeah, no extra smp_wmb() needed here!

Cheers, Lance


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01 10:47               ` Lance Yang
@ 2026-06-01 11:13                 ` David Hildenbrand (Arm)
  2026-06-01 15:00                   ` Nico Pache
  2026-06-02 15:30                 ` Nico Pache
  1 sibling, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01 11:13 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 6/1/26 12:47, Lance Yang wrote:
> 
> 
> On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
>> On 6/1/26 11:08, Lance Yang wrote:
>>>
>>>
>>>
>>> One small thing, I think we should probably keep the smp_wmb(), and just
>>> move it before the earlier pmd_populate().
>>>
>>> IIUC, the ordering we want is still:
>>>
>>>    clear old PTEs
>>>    smp_wmb()
>>>    pmd_populate()
>>>
>>> so another CPU cannot walk through the re-installed PMD and still observe
>>> the old PTEs, right?
>>
>> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> 
> Ah, cool! __folio_mark_uptodate() already does the job :P
> 
> So yeah, no extra smp_wmb() needed here!

Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
draft.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01 11:13                 ` David Hildenbrand (Arm)
@ 2026-06-01 15:00                   ` Nico Pache
  2026-06-01 15:05                     ` David Hildenbrand (Arm)
  2026-06-04 17:04                     ` Nico Pache
  0 siblings, 2 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-01 15:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 6/1/26 12:47, Lance Yang wrote:
> >
> >
> > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> >> On 6/1/26 11:08, Lance Yang wrote:
> >>>
> >>>
> >>>
> >>> One small thing, I think we should probably keep the smp_wmb(), and just
> >>> move it before the earlier pmd_populate().
> >>>
> >>> IIUC, the ordering we want is still:
> >>>
> >>>    clear old PTEs
> >>>    smp_wmb()
> >>>    pmd_populate()
> >>>
> >>> so another CPU cannot walk through the re-installed PMD and still observe
> >>> the old PTEs, right?
> >>
> >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> >
> > Ah, cool! __folio_mark_uptodate() already does the job :P
> >
> > So yeah, no extra smp_wmb() needed here!
>
> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> draft.

Okay, I read the above and did some investigating.

I will try to implement and verify the changes you suggested :)

Or an even crazier idea... what if we ensure MIPS checks for PMD_none
before walking a PTE table?

-- Nico

>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01 15:00                   ` Nico Pache
@ 2026-06-01 15:05                     ` David Hildenbrand (Arm)
  2026-06-01 16:07                       ` Lance Yang
  2026-06-04 17:04                     ` Nico Pache
  1 sibling, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01 15:05 UTC (permalink / raw)
  To: Nico Pache
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 6/1/26 17:00, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> On 6/1/26 12:47, Lance Yang wrote:
>>>
>>>
>>>
>>> Ah, cool! __folio_mark_uptodate() already does the job :P
>>>
>>> So yeah, no extra smp_wmb() needed here!
>>
>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>> draft.
> 
> Okay, I read the above and did some investigating.
> 
> I will try to implement and verify the changes you suggested :)
> 
> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> before walking a PTE table?

But how would they update the cache then correctly?

I'm too non-MIPS to know the answer :)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01 15:05                     ` David Hildenbrand (Arm)
@ 2026-06-01 16:07                       ` Lance Yang
  0 siblings, 0 replies; 114+ messages in thread
From: Lance Yang @ 2026-06-01 16:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif



On 2026/6/1 23:05, David Hildenbrand (Arm) wrote:
> On 6/1/26 17:00, Nico Pache wrote:
>> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>
>>> On 6/1/26 12:47, Lance Yang wrote:
>>>>
>>>>
>>>>
>>>> Ah, cool! __folio_mark_uptodate() already does the job :P
>>>>
>>>> So yeah, no extra smp_wmb() needed here!
>>>
>>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>>> draft.
>>
>> Okay, I read the above and did some investigating.
>>
>> I will try to implement and verify the changes you suggested :)
>>
>> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
>> before walking a PTE table?
> 
> But how would they update the cache then correctly?
> 
> I'm too non-MIPS to know the answer :)

Right, that's my concern as well ...

If MIPS sees pmd_none(), it has no PTE table to walk, so it also has
no way to do the cache update it wanted to do, I guess :)

But, the PTE table is not really gone there. khugepaged only cleared
the PMD temporarily while still using the old PTE table through _pmd.

So I'd go with David's suggestion:

"
Best to make sure the page table is already installed when updating
the entries.
"

Cheers, Lance

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01 15:00                   ` Nico Pache
  2026-06-01 15:05                     ` David Hildenbrand (Arm)
@ 2026-06-04 17:04                     ` Nico Pache
  2026-06-04 18:12                       ` Lorenzo Stoakes
  2026-06-05  7:18                       ` David Hildenbrand (Arm)
  1 sibling, 2 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-04 17:04 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>
> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >
> > On 6/1/26 12:47, Lance Yang wrote:
> > >
> > >
> > > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > >> On 6/1/26 11:08, Lance Yang wrote:
> > >>>
> > >>>
> > >>>
> > >>> One small thing, I think we should probably keep the smp_wmb(), and just
> > >>> move it before the earlier pmd_populate().
> > >>>
> > >>> IIUC, the ordering we want is still:
> > >>>
> > >>>    clear old PTEs
> > >>>    smp_wmb()
> > >>>    pmd_populate()
> > >>>
> > >>> so another CPU cannot walk through the re-installed PMD and still observe
> > >>> the old PTEs, right?
> > >>
> > >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> > >
> > > Ah, cool! __folio_mark_uptodate() already does the job :P
> > >
> > > So yeah, no extra smp_wmb() needed here!
> >
> > Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> > draft.
>
> Okay, I read the above and did some investigating.
>
> I will try to implement and verify the changes you suggested :)

I've implemented something slightly different actually and I *think* its better!

} else {
       /* this is map_anon_folio_pte_nopf with no mmu update */
        __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
                      /*uffd_wp=*/ false);
       smp_wmb();
        pmd_populate(mm, pmd, pmd_pgtable(_pmd));
        /*
         * Some architectures (e.g. MIPS) walk the live page table in
         * their implementation. update_mmu_cache_range() must be called
         * with a valid page table hierarchy and the PTE lock held.
         * Acquire it nested inside pmd_ptl when they are distinct locks.
         */
        if (pte_ptl != pmd_ptl)
            spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
        update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
        if (pte_ptl != pmd_ptl)
            spin_unlock(pte_ptl);
    }
spin_unlock(pmd_ptl);

The logic here is that when the PMD becomes visible, PTEs are already
populated (no possibility of spurious faults on local CPU)

the SMP_WMB makes sure of the above

And the pmd is installed with the pte and pmd lock both held through
the mmu_cache update.

This follows the conventions used in pmd_install() and clears the
potential for local CPU faults hitting cleared PTE entries.

I think both approaches are correct but this prevents any possibility
of my first point. although mmap_write_lock prevents this too.

Let me know what you think. I can revert to your implementation but
this is what I tested.

Cheers,
-- Nico

>
> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> before walking a PTE table?
>
> -- Nico
>
> >
> > --
> > Cheers,
> >
> > David
> >


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-04 17:04                     ` Nico Pache
@ 2026-06-04 18:12                       ` Lorenzo Stoakes
  2026-06-05  7:18                       ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 18:12 UTC (permalink / raw)
  To: Nico Pache
  Cc: David Hildenbrand (Arm), Lance Yang, linux-doc, linux-kernel,
	linux-mm, linux-trace-kernel, aarcange, akpm, anshuman.khandual,
	apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
	corbet, dave.hansen, dev.jain, gourry, hannes, hughd, jack,
	jackmanb, jannh, jglisse, joshua.hahnjy, kas, liam,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	usama.arif

On Thu, Jun 04, 2026 at 11:04:35AM -0600, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> > >
> > > On 6/1/26 12:47, Lance Yang wrote:
> > > >
> > > >
> > > > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > > >> On 6/1/26 11:08, Lance Yang wrote:
> > > >>>
> > > >>>
> > > >>>
> > > >>> One small thing, I think we should probably keep the smp_wmb(), and just
> > > >>> move it before the earlier pmd_populate().
> > > >>>
> > > >>> IIUC, the ordering we want is still:
> > > >>>
> > > >>>    clear old PTEs
> > > >>>    smp_wmb()
> > > >>>    pmd_populate()
> > > >>>
> > > >>> so another CPU cannot walk through the re-installed PMD and still observe
> > > >>> the old PTEs, right?
> > > >>
> > > >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> > > >
> > > > Ah, cool! __folio_mark_uptodate() already does the job :P
> > > >
> > > > So yeah, no extra smp_wmb() needed here!
> > >
> > > Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> > > draft.
> >
> > Okay, I read the above and did some investigating.
> >
> > I will try to implement and verify the changes you suggested :)
>
> I've implemented something slightly different actually and I *think* its better!
>
> } else {
>        /* this is map_anon_folio_pte_nopf with no mmu update */
>         __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
>                       /*uffd_wp=*/ false);
>        smp_wmb();
>         pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>         /*
>          * Some architectures (e.g. MIPS) walk the live page table in
>          * their implementation. update_mmu_cache_range() must be called
>          * with a valid page table hierarchy and the PTE lock held.
>          * Acquire it nested inside pmd_ptl when they are distinct locks.
>          */
>         if (pte_ptl != pmd_ptl)
>             spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
>         update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
>         if (pte_ptl != pmd_ptl)
>             spin_unlock(pte_ptl);
>     }
> spin_unlock(pmd_ptl);
>
> The logic here is that when the PMD becomes visible, PTEs are already
> populated (no possibility of spurious faults on local CPU)
>
> the SMP_WMB makes sure of the above
>
> And the pmd is installed with the pte and pmd lock both held through
> the mmu_cache update.
>
> This follows the conventions used in pmd_install() and clears the
> potential for local CPU faults hitting cleared PTE entries.
>
> I think both approaches are correct but this prevents any possibility
> of my first point. although mmap_write_lock prevents this too.
>
> Let me know what you think. I can revert to your implementation but
> this is what I tested.

Yeah let's go with the original implementation please :)

Thanks!

>
> Cheers,
> -- Nico
>
> >
> > Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> > before walking a PTE table?
> >
> > -- Nico
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David
> > >
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-04 17:04                     ` Nico Pache
  2026-06-04 18:12                       ` Lorenzo Stoakes
@ 2026-06-05  7:18                       ` David Hildenbrand (Arm)
  2026-06-05  8:07                         ` Lorenzo Stoakes
  1 sibling, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-05  7:18 UTC (permalink / raw)
  To: Nico Pache
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 6/4/26 19:04, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>>
>> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>
>>>
>>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>>> draft.
>>
>> Okay, I read the above and did some investigating.
>>
>> I will try to implement and verify the changes you suggested :)
> 
> I've implemented something slightly different actually and I *think* its better!
> 
> } else {
>        /* this is map_anon_folio_pte_nopf with no mmu update */
>         __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
>                       /*uffd_wp=*/ false);
>        smp_wmb();
>         pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>         /*
>          * Some architectures (e.g. MIPS) walk the live page table in
>          * their implementation. update_mmu_cache_range() must be called
>          * with a valid page table hierarchy and the PTE lock held.
>          * Acquire it nested inside pmd_ptl when they are distinct locks.
>          */
>         if (pte_ptl != pmd_ptl)
>             spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
>         update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
>         if (pte_ptl != pmd_ptl)
>             spin_unlock(pte_ptl);
>     }
> spin_unlock(pmd_ptl);
> 
> The logic here is that when the PMD becomes visible, PTEs are already
> populated (no possibility of spurious faults on local CPU)
> 
> the SMP_WMB makes sure of the above
> 
> And the pmd is installed with the pte and pmd lock both held through
> the mmu_cache update.
> 
> This follows the conventions used in pmd_install() and clears the
> potential for local CPU faults hitting cleared PTE entries.

After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
PMD already? So the case here is rather different.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-05  7:18                       ` David Hildenbrand (Arm)
@ 2026-06-05  8:07                         ` Lorenzo Stoakes
  2026-06-05  8:59                           ` Lance Yang
  0 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-05  8:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, Lance Yang, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif

On Fri, Jun 05, 2026 at 09:18:27AM +0200, David Hildenbrand (Arm) wrote:
> On 6/4/26 19:04, Nico Pache wrote:
> > On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
> >>
> >> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >>>
> >>>
> >>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> >>> draft.
> >>
> >> Okay, I read the above and did some investigating.
> >>
> >> I will try to implement and verify the changes you suggested :)
> >
> > I've implemented something slightly different actually and I *think* its better!
> >
> > } else {
> >        /* this is map_anon_folio_pte_nopf with no mmu update */
> >         __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> >                       /*uffd_wp=*/ false);
> >        smp_wmb();
> >         pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >         /*
> >          * Some architectures (e.g. MIPS) walk the live page table in
> >          * their implementation. update_mmu_cache_range() must be called
> >          * with a valid page table hierarchy and the PTE lock held.
> >          * Acquire it nested inside pmd_ptl when they are distinct locks.
> >          */
> >         if (pte_ptl != pmd_ptl)
> >             spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> >         update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> >         if (pte_ptl != pmd_ptl)
> >             spin_unlock(pte_ptl);
> >     }
> > spin_unlock(pmd_ptl);
> >
> > The logic here is that when the PMD becomes visible, PTEs are already
> > populated (no possibility of spurious faults on local CPU)
> >
> > the SMP_WMB makes sure of the above

THe locks prevent those 'spurious' (really: incorrect) faults anyway so I don't
think this is necessary.

> >
> > And the pmd is installed with the pte and pmd lock both held through
> > the mmu_cache update.
> >
> > This follows the conventions used in pmd_install() and clears the
> > potential for local CPU faults hitting cleared PTE entries.
>
> After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
> PMD already? So the case here is rather different.

Yeah conceptually the code above is problematic because you immediately make the
PTE available right at the point you populate, so taking a PTE lock after that
is rather shutting the stable door after the horse has bolted.

Doing it this way is not a good idea in any case because we're adding
complexity, an extra function and an open-coded cache maintenance call for
really no benefit.

I asked Nico to abstract the anon folio mapping stuff explicitly so we could
avoid this sort of duplication so let's not roll that back :)

So again, I think going with the original suggestion (with an updated comment)
is the right thing to do.


Anyway, an aside But in practice we can't have page faults here right? The VMA is:

- Ensured to span at least the PMD range (this isn't immediately obvious in the
  code)
- VMA write locked (mmap write lock held)

And we hold the anon_vma lock so no rmap walkers can walk the page tables here
either.

So I actually wonder, given that, whether we need the PTE PTL at all.

But.

At this stage it'll almost certainly be an owned exclusive cache line so it's
very low cost to do it, and it means we honour the update_mmu_cache_range()
contract.

And it also makes it clear that we're gating changes on the PTE being
untouchable so any future stuff that maybe changes some of these rules doesn't
get caught out.

So probably worth keeping.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-05  8:07                         ` Lorenzo Stoakes
@ 2026-06-05  8:59                           ` Lance Yang
  0 siblings, 0 replies; 114+ messages in thread
From: Lance Yang @ 2026-06-05  8:59 UTC (permalink / raw)
  To: ljs, david, npache
  Cc: lance.yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif


On Fri, Jun 05, 2026 at 09:07:23AM +0100, Lorenzo Stoakes wrote:
>On Fri, Jun 05, 2026 at 09:18:27AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/4/26 19:04, Nico Pache wrote:
>> > On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>> >>
>> >> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>> >>>
>> >>>
>> >>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>> >>> draft.
>> >>
>> >> Okay, I read the above and did some investigating.
>> >>
>> >> I will try to implement and verify the changes you suggested :)
>> >
>> > I've implemented something slightly different actually and I *think* its better!
>> >
>> > } else {
>> >        /* this is map_anon_folio_pte_nopf with no mmu update */
>> >         __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
>> >                       /*uffd_wp=*/ false);
>> >        smp_wmb();
>> >         pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> >         /*
>> >          * Some architectures (e.g. MIPS) walk the live page table in
>> >          * their implementation. update_mmu_cache_range() must be called
>> >          * with a valid page table hierarchy and the PTE lock held.
>> >          * Acquire it nested inside pmd_ptl when they are distinct locks.
>> >          */
>> >         if (pte_ptl != pmd_ptl)
>> >             spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
>> >         update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
>> >         if (pte_ptl != pmd_ptl)
>> >             spin_unlock(pte_ptl);
>> >     }
>> > spin_unlock(pmd_ptl);
>> >
>> > The logic here is that when the PMD becomes visible, PTEs are already
>> > populated (no possibility of spurious faults on local CPU)
>> >
>> > the SMP_WMB makes sure of the above
>
>THe locks prevent those 'spurious' (really: incorrect) faults anyway so I don't
>think this is necessary.
>
>> >
>> > And the pmd is installed with the pte and pmd lock both held through
>> > the mmu_cache update.
>> >
>> > This follows the conventions used in pmd_install() and clears the
>> > potential for local CPU faults hitting cleared PTE entries.
>>
>> After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
>> PMD already? So the case here is rather different.

The issue I was worried about: update_mmu_cache_range() can re-walk
vma->vm_mm while the PTE page table is still not reachable through the
PMD. And, yeah, that assumption is ugly, but it is what it is, and there
maybe be similar code elsewhere ...

So the ordering we need is "the PMD points to the PTE page table from
_pmd before update_mmu_cache_range()", not "new PTEs before PMD".

Those PTEs are cleared, but we hold the PTL, so nobody else can install
anything there :)

So David's original suggestion looks enough to me:

if (pte_ptl != pmd_ptl)
        spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);

pmd_populate();
map_anon_folio_pte_nopf();

if (pte_ptl != pmd_ptl)
        spin_unlock(pte_ptl);

>Yeah conceptually the code above is problematic because you immediately make the
>PTE available right at the point you populate, so taking a PTE lock after that
>is rather shutting the stable door after the horse has bolted.
>
>Doing it this way is not a good idea in any case because we're adding
>complexity, an extra function and an open-coded cache maintenance call for
>really no benefit.
>
>I asked Nico to abstract the anon folio mapping stuff explicitly so we could
>avoid this sort of duplication so let's not roll that back :)
>
>So again, I think going with the original suggestion (with an updated comment)
>is the right thing to do.
>
>
>Anyway, an aside But in practice we can't have page faults here right? The VMA is:
>
>- Ensured to span at least the PMD range (this isn't immediately obvious in the
>  code)
>- VMA write locked (mmap write lock held)
>
>And we hold the anon_vma lock so no rmap walkers can walk the page tables here
>either.
>
>So I actually wonder, given that, whether we need the PTE PTL at all.

I'd keep it. Cheap, and lets us sleep better at night :P

>But.
>
>At this stage it'll almost certainly be an owned exclusive cache line so it's
>very low cost to do it, and it means we honour the update_mmu_cache_range()
>contract.
>
>And it also makes it clear that we're gating changes on the PTE being
>untouchable so any future stuff that maybe changes some of these rules doesn't
>get caught out.
>
>So probably worth keeping.

Yes!

Cheers, Lance

>>
>> --
>> Cheers,
>>
>> David
>
>Thanks, Lorenzo
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01 10:47               ` Lance Yang
  2026-06-01 11:13                 ` David Hildenbrand (Arm)
@ 2026-06-02 15:30                 ` Nico Pache
  2026-06-02 16:34                   ` Lance Yang
  1 sibling, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-06-02 15:30 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif

On Mon, Jun 1, 2026 at 4:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > On 6/1/26 11:08, Lance Yang wrote:
> >>
> >>
> >> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
> >>> On 6/1/26 05:28, Lance Yang wrote:
> >>>>
> >>>>
> >>>> Ah, fair point.
> >>>>
> >>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> >>>> than only using the pte pointer passed in. For example, mips does:
> >>>
> >>> Right, a re-walk would be the real problem.
> >>>
> >>>>
> >>>>     update_mmu_cache_range()
> >>>>       -> __update_tlb()
> >>>>         -> pgd_offset(vma->vm_mm, address)
> >>>>         -> pte_offset_map(...)
> >>>>
> >>>> and __update_tlb() has this assumption:
> >>>>
> >>>>          /*
> >>>>           * update_mmu_cache() is called between pte_offset_map_lock()
> >>>>           * and pte_unmap_unlock(), so we can assume that ptep is not
> >>>>           * NULL here: and what should be done below if it were NULL?
> >>>>           */
> >>>>
> >>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
> >>>> here, could __update_tlb() hit the none PMD, get NULL from
> >>>> pte_offset_map(), and then dereference it?
> >>>
> >>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> >>> even spells that out. :(
> >>>
> >>> Do you know about other code like that, or is MIPS the only one doing a
> >>> re-walk and crossing fingers?
> >>>
> >>>>
> >>>> Just wanted to raise it since some arch code may still have assumptions
> >>>> like this, and the always-enable-mTHP work is getting closer ...
> >>>
> >>> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in
> >>> arch code,
> >>> because we simply provide the ptep. update_mmu_cache_range() only consumes the
> >>> pte.
> >>>
> >>>>
> >>>> Probably very very very hard to hit, though :)
> >>>
> >>> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> >>> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
> >>>
> >>> Maybe we could reinstall the page table with the cleared (none) entries while
> >>> still holding the PTL?
> >>>
> >>> Thinking out loud:
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 5ba298d420b7..e39b750b1e6f 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct
> >>> mm_struct *mm, unsigned long s
> >>>                   map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> >>>           } else {
> >>>                   /*
> >>> -                * set_ptes is called in map_anon_folio_pte_nopf with the
> >>> -                * pmd_ptl lock still held; this is safe as the PMD is expected
> >>> -                * to be none. The pmd entry is then repopulated below.
> >>> +                * Re-insert the page table with the cleared entries, but
> >>> +                * hold the PTL, such that no one can mess with the re-installed
> >>> +                * page table until we updated the temporarily-cleared entries
> >>> +                * through map_anon_folio_pte_nopf().
> >>>                    */
> >>> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /
> >>> *uffd_wp=*/ false);
> >>> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> >>
> >> One small thing, I think we should probably keep the smp_wmb(), and just
> >> move it before the earlier pmd_populate().
> >>
> >> IIUC, the ordering we want is still:
> >>
> >>    clear old PTEs
> >>    smp_wmb()
> >>    pmd_populate()
> >>
> >> so another CPU cannot walk through the re-installed PMD and still observe
> >> the old PTEs, right?
> >
> > There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
>
> Ah, cool! __folio_mark_uptodate() already does the job :P
>
> So yeah, no extra smp_wmb() needed here!

are we sure? that folio_mark_uptodate is done before the PTEs are
reinstalled. Then we reinstall the PMD right after. Currently
separated by the smp_wmb().

I was copying this from other THP code that performs similar PTE/PMD juggling.

I can remove it, but I'd rather air on the side of caution with this.

>
> Cheers, Lance
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-02 15:30                 ` Nico Pache
@ 2026-06-02 16:34                   ` Lance Yang
  0 siblings, 0 replies; 114+ messages in thread
From: Lance Yang @ 2026-06-02 16:34 UTC (permalink / raw)
  To: npache
  Cc: lance.yang, david, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif


On Tue, Jun 02, 2026 at 09:30:06AM -0600, Nico Pache wrote:
>On Mon, Jun 1, 2026 at 4:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
>> > On 6/1/26 11:08, Lance Yang wrote:
>> >>
>> >>
>> >> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>> >>> On 6/1/26 05:28, Lance Yang wrote:
>> >>>>
>> >>>>
>> >>>> Ah, fair point.
>> >>>>
>> >>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>> >>>> than only using the pte pointer passed in. For example, mips does:
>> >>>
>> >>> Right, a re-walk would be the real problem.
>> >>>
>> >>>>
>> >>>>     update_mmu_cache_range()
>> >>>>       -> __update_tlb()
>> >>>>         -> pgd_offset(vma->vm_mm, address)
>> >>>>         -> pte_offset_map(...)
>> >>>>
>> >>>> and __update_tlb() has this assumption:
>> >>>>
>> >>>>          /*
>> >>>>           * update_mmu_cache() is called between pte_offset_map_lock()
>> >>>>           * and pte_unmap_unlock(), so we can assume that ptep is not
>> >>>>           * NULL here: and what should be done below if it were NULL?
>> >>>>           */
>> >>>>
>> >>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>> >>>> here, could __update_tlb() hit the none PMD, get NULL from
>> >>>> pte_offset_map(), and then dereference it?
>> >>>
>> >>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>> >>> even spells that out. :(
>> >>>
>> >>> Do you know about other code like that, or is MIPS the only one doing a
>> >>> re-walk and crossing fingers?
>> >>>
>> >>>>
>> >>>> Just wanted to raise it since some arch code may still have assumptions
>> >>>> like this, and the always-enable-mTHP work is getting closer ...
>> >>>
>> >>> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in
>> >>> arch code,
>> >>> because we simply provide the ptep. update_mmu_cache_range() only consumes the
>> >>> pte.
>> >>>
>> >>>>
>> >>>> Probably very very very hard to hit, though :)
>> >>>
>> >>> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
>> >>> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
>> >>>
>> >>> Maybe we could reinstall the page table with the cleared (none) entries while
>> >>> still holding the PTL?
>> >>>
>> >>> Thinking out loud:
>> >>>
>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> >>> index 5ba298d420b7..e39b750b1e6f 100644
>> >>> --- a/mm/khugepaged.c
>> >>> +++ b/mm/khugepaged.c
>> >>> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct
>> >>> mm_struct *mm, unsigned long s
>> >>>                   map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> >>>           } else {
>> >>>                   /*
>> >>> -                * set_ptes is called in map_anon_folio_pte_nopf with the
>> >>> -                * pmd_ptl lock still held; this is safe as the PMD is expected
>> >>> -                * to be none. The pmd entry is then repopulated below.
>> >>> +                * Re-insert the page table with the cleared entries, but
>> >>> +                * hold the PTL, such that no one can mess with the re-installed
>> >>> +                * page table until we updated the temporarily-cleared entries
>> >>> +                * through map_anon_folio_pte_nopf().
>> >>>                    */
>> >>> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /
>> >>> *uffd_wp=*/ false);
>> >>> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>> >>
>> >> One small thing, I think we should probably keep the smp_wmb(), and just
>> >> move it before the earlier pmd_populate().
>> >>
>> >> IIUC, the ordering we want is still:
>> >>
>> >>    clear old PTEs
>> >>    smp_wmb()
>> >>    pmd_populate()
>> >>
>> >> so another CPU cannot walk through the re-installed PMD and still observe
>> >> the old PTEs, right?
>> >
>> > There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
>>
>> Ah, cool! __folio_mark_uptodate() already does the job :P
>>
>> So yeah, no extra smp_wmb() needed here!
>
>are we sure? that folio_mark_uptodate is done before the PTEs are
>reinstalled. Then we reinstall the PMD right after. Currently
>separated by the smp_wmb().

Reinstalling the PMD first makes the PTE table reachable again, right?

So before pmd_populate(), we only need to order the old PTE clears before
the PTE table is reachable again; __folio_mark_uptodate() already has the
smp_wmb() for that :)

The new PTEs are filled later under the PTL.

Hopefully I didn't miss soemthing :)

>I was copying this from other THP code that performs similar PTE/PMD juggling.
>
>I can remove it, but I'd rather air on the side of caution with this.

Cheers, Lance

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-01  6:54         ` David Hildenbrand (Arm)
  2026-06-01  7:49           ` Lance Yang
  2026-06-01  9:08           ` Lance Yang
@ 2026-06-04 12:33           ` Lorenzo Stoakes
  2 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 12:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lance Yang, npache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif

On Mon, Jun 01, 2026 at 08:54:24AM +0200, David Hildenbrand (Arm) wrote:
> On 6/1/26 05:28, Lance Yang wrote:
> >
> > On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
> >> On 5/31/26 11:39, Lance Yang wrote:
> >>>
> >>>
> >>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
> >>>
> >>> At this point pmdp_collapse_flush() has cleared the PMD from the page
> >>> tables. The PTE table we are updating is only reachable through the saved
> >>> old PMD value, _pmd, until pmd_populate() below.
> >>>
> >>> map_anon_folio_pte_nopf() does set_ptes() and then calls
> >>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
> >>> that hook as:
> >>>
> >>> "
> >>> 	At the end of every page fault, this routine is invoked to tell
> >>> 	the architecture specific code that translations now exists
> >>> 	in the software page tables for address space "vma->vm_mm"
> >>> 	at virtual address "address" for "nr" consecutive pages.
> >>> "
> >>>
> >>> But that does not seem true here yet, since the PTE table is not
> >>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
> >>>
> >>> Should we avoid calling update_mmu_cache_range() until after the PTE
> >>> table is reinstalled with pmd_populate()?
> >>
> >> I recall that update_mmu_cache* users mostly care about updating folios flags,
> >> for the folio derived from the PTE ... or flushing caches for the user address.
> >>
> >> So intuitively I would say "the architecture code doesn't care that the PMD
> >> table will only be visible to HW shortly after". The important thing should be
> >> that it will definetly happen, and that nothing else is curently there or can be
> >> there?
> >
> > Ah, fair point.
> >
> > I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> > than only using the pte pointer passed in. For example, mips does:
>
> Right, a re-walk would be the real problem.
>
> >
> >   update_mmu_cache_range()
> >     -> __update_tlb()
> >       -> pgd_offset(vma->vm_mm, address)
> >       -> pte_offset_map(...)
> >
> > and __update_tlb() has this assumption:
> >
> > 		/*
> > 		 * update_mmu_cache() is called between pte_offset_map_lock()
> > 		 * and pte_unmap_unlock(), so we can assume that ptep is not
> > 		 * NULL here: and what should be done below if it were NULL?
> > 		 */
> >
> > So if khugepaged happens to run with current->active_mm == vma->vm_mm
> > here, could __update_tlb() hit the none PMD, get NULL from

I really wish people would say Pxx _entry_ :) so confusing.

> > pte_offset_map(), and then dereference it?
>
> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> even spells that out. :(
>
> Do you know about other code like that, or is MIPS the only one doing a
> re-walk and crossing fingers?
>
> >
> > Just wanted to raise it since some arch code may still have assumptions
> > like this, and the always-enable-mTHP work is getting closer ...
>
> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
> because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.
>
> >
> > Probably very very very hard to hit, though :)
>
> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
>
> Maybe we could reinstall the page table with the cleared (none) entries while
> still holding the PTL?

You mean the cleared PTE entries that are to be updated with the collapsed
larger folio?

>
> Thinking out loud:

After staring at this long enough, this does seems like a viable solution yes.

I hate how subtle this is.

>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5ba298d420b7..e39b750b1e6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>                 map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>         } else {
>                 /*
> -                * set_ptes is called in map_anon_folio_pte_nopf with the
> -                * pmd_ptl lock still held; this is safe as the PMD is expected
> -                * to be none. The pmd entry is then repopulated below.
> +                * Re-insert the page table with the cleared entries, but
> +                * hold the PTL, such that no one can mess with the re-installed
> +                * page table until we updated the temporarily-cleared entries
> +                * through map_anon_folio_pte_nopf().
>                  */

You may say nit, but, I think we should be clearly stating the problem here. Yes
we want to hold the PTL to stop anybody else messing with it yet, but we're
really doing this because of:

map_anon_folio_pte_nopf
-> update_mmu_cache_range
-> rewalk
-> try to look up an entry that's not yet actually installed
-> bang

Right?

So maybe something like:

	Re-insert the PMD entry pointing to the PTE page table with cleared
	entries first, because map_anon_folio_pte_nopf() invokes
	update_mmu_cache_range() which may cause a rewalk of the page tables and
	blow up if the supplied PTE entry belongs to a PTE table that is not yet
	present there.

	We hold the PTE PTL to avoid anything else messing with this until we're
	ready.


> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */

(I guess better to comment on the smp_wmb() stuff in the other message about
this.)

> +               if (pte_ptl != pmd_ptl)
> +                       spin_lock(pte_ptl);

(Obviously should be spin_lock_nested() as David says later)

It seems a bit weird to me that we acquire the PTE lock:

	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);

Clear out the mTHP entries we're going to remove:

	if (pte) {
		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
						      order, &compound_pagelist);

THen unlock the PTE:

		spin_unlock(pte_ptl);

Before again reacquiring here, especially given this is an unreachable PTE
table.

But then again not doing that would require us to add some error handling logic
to unlock again so it's probably not vital.

>                 pmd_populate(mm, pmd, pmd_pgtable(_pmd));

So we're protecting against concurrent rmap and fault handlers with the PTL such
that installing this is safe right?

Are we good against GUP fast? I guess a race will be fine with that, or will it?
I suppose before it would have skipped the range entirely because of the missing
PMD entry anyway.

(in any case we also hold anon_vma write lock too.)

> +               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +               if (pte_ptl != pmd_ptl)
> +                       spin_unlock(pte_ptl);
>         }
>         spin_unlock(pmd_ptl);
>
>
>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
                     ` (2 preceding siblings ...)
  2026-05-31  9:39   ` Lance Yang
@ 2026-06-04 10:21   ` Lorenzo Stoakes
  2026-06-04 10:32     ` Nico Pache
  2026-06-04 11:38   ` Lorenzo Stoakes
  4 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 10:21 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 55 insertions(+), 38 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fab35d318641..d64f42f66236 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>   * while allocating a THP, as that could trigger direct reclaim/compaction.
>   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>   */
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -		int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		unsigned int order)
>  {
> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte = NULL;

Hmm, this part of the patch wasn't taken, and now we have uninitialised state
being dereferenced (see [0])

[0]:https://lore.kernel.org/all/aiFO1RlpZ7Ki44y1@lucifer/

Did a review comment here somehow cause this to be changed in the patch?

Andrew - was there an error in applying the patch somehow?

Thanks, Lorenzo

>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	enum scan_result result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	bool anon_vma_locked = false;
>
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
>  	}
>
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> +						     referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
>  	vma_start_write(vma);
> -	result = check_pmd_still_valid(mm, address, pmd);
> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>
>  	anon_vma_lock_write(vma->anon_vma);
> +	anon_vma_locked = true;
>
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> +				end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> +						      order, &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_NO_PTE_TABLE;
>  	}
>
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);
>  		spin_lock(pmd_ptl);
> -		BUG_ON(!pmd_none(*pmd));
> +		WARN_ON_ONCE(!pmd_none(*pmd));
>  		/*
>  		 * We can only use set_pmd_at when establishing
>  		 * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> +	 * removed and not all pages are isolated and locked, so we must hold
> +	 * the lock to prevent neighboring folios from attempting to access
> +	 * this PMD until its reinstalled.
>  	 */
> -	anon_vma_unlock_write(vma->anon_vma);
> +	if (is_pmd_order(order)) {
> +		anon_vma_unlock_write(vma->anon_vma);
> +		anon_vma_locked = false;
> +	}
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   HPAGE_PMD_ORDER,
> -					   &compound_pagelist);
> -	pte_unmap(pte);
> +					   vma, start_addr, pte_ptl,
> +					   order, &compound_pagelist);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>
> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
>  	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> +	WARN_ON_ONCE(!pmd_none(*pmd));
> +	if (is_pmd_order(order)) {
> +		pgtable = pmd_pgtable(_pmd);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> +	} else {
> +		/*
> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
> +		 * to be none. The pmd entry is then repopulated below.
> +		 */
> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +	}
>  	spin_unlock(pmd_ptl);
>
>  	folio = NULL;
>
>  	result = SCAN_SUCCEED;
>  out_up_write:
> +	if (anon_vma_locked)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
>  	if (folio)
> @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, HPAGE_PMD_ORDER);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*lock_dropped = true;
>  	}
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-04 10:21   ` Lorenzo Stoakes
@ 2026-06-04 10:32     ` Nico Pache
  0 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-04 10:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

On Thu, Jun 4, 2026 at 4:22 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the mTHP case this is not true, and we must keep the lock to prevent
> > access/changes to the page tables. This can happen if the rmap walkers hit
> > a pmd_none while the PMD entry is currently unavailable due to being
> > temporarily removed during the collapse phase.
> >
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> >  1 file changed, 55 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fab35d318641..d64f42f66236 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> >   */
> > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > -             int referenced, int unmapped, struct collapse_control *cc)
> > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > +             int referenced, int unmapped, struct collapse_control *cc,
> > +             unsigned int order)
> >  {
> > +     const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > +     const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte = NULL;
>
> Hmm, this part of the patch wasn't taken, and now we have uninitialised state
> being dereferenced (see [0])

Good catch, I was just looking at your report and wondering what
happened there. Hopefully, with the v19 we apply this correctly :)

-- Nico

>
> [0]:https://lore.kernel.org/all/aiFO1RlpZ7Ki44y1@lucifer/
>
> Did a review comment here somehow cause this to be changed in the patch?
>
> Andrew - was there an error in applying the patch somehow?
>
> Thanks, Lorenzo
>
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       enum scan_result result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     bool anon_vma_locked = false;
> >
> > -     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > -
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +                                      &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> >       }
> >
> > -     result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +     result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                                                  referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > +                                                  referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +                                      &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> >       vma_start_write(vma);
> > -     result = check_pmd_still_valid(mm, address, pmd);
> > +     result = check_pmd_still_valid(mm, pmd_addr, pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >
> >       anon_vma_lock_write(vma->anon_vma);
> > +     anon_vma_locked = true;
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > +                             end_addr);
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * Parallel GUP-fast is fine since GUP-fast will back off when
> >        * it detects PMD is changed.
> >        */
> > -     _pmd = pmdp_collapse_flush(vma, address, pmd);
> > +     _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> >       spin_unlock(pmd_ptl);
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                                   HPAGE_PMD_ORDER,
> > -                                                   &compound_pagelist);
> > +             result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > +                                                   order, &compound_pagelist);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_NO_PTE_TABLE;
> >       }
> >
> >       if (unlikely(result != SCAN_SUCCEED)) {
> > -             if (pte)
> > -                     pte_unmap(pte);
> >               spin_lock(pmd_ptl);
> > -             BUG_ON(!pmd_none(*pmd));
> > +             WARN_ON_ONCE(!pmd_none(*pmd));
> >               /*
> >                * We can only use set_pmd_at when establishing
> >                * hugepmds and never for establishing regular pmds that
> > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >                */
> >               pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >               spin_unlock(pmd_ptl);
> > -             anon_vma_unlock_write(vma->anon_vma);
> >               goto out_up_write;
> >       }
> >
> >       /*
> > -      * All pages are isolated and locked so anon_vma rmap
> > -      * can't run anymore.
> > +      * For PMD collapse all pages are isolated and locked so anon_vma
> > +      * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > +      * removed and not all pages are isolated and locked, so we must hold
> > +      * the lock to prevent neighboring folios from attempting to access
> > +      * this PMD until its reinstalled.
> >        */
> > -     anon_vma_unlock_write(vma->anon_vma);
> > +     if (is_pmd_order(order)) {
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +             anon_vma_locked = false;
> > +     }
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        HPAGE_PMD_ORDER,
> > -                                        &compound_pagelist);
> > -     pte_unmap(pte);
> > +                                        vma, start_addr, pte_ptl,
> > +                                        order, &compound_pagelist);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> >
> > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> >       spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > +     WARN_ON_ONCE(!pmd_none(*pmd));
> > +     if (is_pmd_order(order)) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > +     } else {
> > +             /*
> > +              * set_ptes is called in map_anon_folio_pte_nopf with the
> > +              * pmd_ptl lock still held; this is safe as the PMD is expected
> > +              * to be none. The pmd entry is then repopulated below.
> > +              */
> > +             map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > +             smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +     }
> >       spin_unlock(pmd_ptl);
> >
> >       folio = NULL;
> >
> >       result = SCAN_SUCCEED;
> >  out_up_write:
> > +     if (anon_vma_locked)
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +     if (pte)
> > +             pte_unmap(pte);
> >       mmap_write_unlock(mm);
> >  out_nolock:
> >       if (folio)
> > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> >               result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc);
> > +                                         unmapped, cc, HPAGE_PMD_ORDER);
> >               /* collapse_huge_page will return with the mmap_lock released */
> >               *lock_dropped = true;
> >       }
> > --
> > 2.54.0
> >
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
                     ` (3 preceding siblings ...)
  2026-06-04 10:21   ` Lorenzo Stoakes
@ 2026-06-04 11:38   ` Lorenzo Stoakes
  2026-06-04 12:39     ` Lorenzo Stoakes
  4 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 11:38 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

I will go review the thread about the cache maintenance separately and
respond about that.

On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>

The logic LGTM generally, some questions for understanding below, and of
course as per above I want to review the Lance/David subthread.

Thanks!

> ---
>  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 55 insertions(+), 38 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fab35d318641..d64f42f66236 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>   * while allocating a THP, as that could trigger direct reclaim/compaction.
>   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>   */
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -		int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		unsigned int order)
>  {
> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte = NULL;

As mentioned elsewhere for some reason this was dropped in
mm-unstable. Maybe a bad conflict resolution?

>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	enum scan_result result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	bool anon_vma_locked = false;
>
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
>  	}
>
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> +						     referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
>  	vma_start_write(vma);
> -	result = check_pmd_still_valid(mm, address, pmd);
> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>
>  	anon_vma_lock_write(vma->anon_vma);
> +	anon_vma_locked = true;

I worry that we hold this lock a lot longer now? Maybe the algorithmic
change alters that, but Claude did suggest on the s390 bug that longer lock
hold might be an issue.

I wonder if we'll observe lock contention as a result?

Correct me if I'm wrong and we're not holding longer than previously,
however. Just appears that we do.

>
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> +				end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> +						      order, &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_NO_PTE_TABLE;
>  	}
>
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);

OK I seem to remember this is because we're holding the anon_vma lock
longer. That does imply that on e.g. x86-64 the RCU lock is being held a
bit longer also as well as the anon_vma loc.

I guess it's also because we need to hold anon_vma and pte lock because
we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
THP did.

(Rememberings going on here :)

>  		spin_lock(pmd_ptl);
> -		BUG_ON(!pmd_none(*pmd));
> +		WARN_ON_ONCE(!pmd_none(*pmd));
>  		/*
>  		 * We can only use set_pmd_at when establishing
>  		 * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> +	 * removed and not all pages are isolated and locked, so we must hold

Right because some PTE entries be unaffected by the change.

> +	 * the lock to prevent neighboring folios from attempting to access
> +	 * this PMD until its reinstalled.

OK. This is slightly annoying for my CoW context work as it means there's
another case where we need to explicitly hold an anon_vma lock for
correctness :)

Anyway I will think about that separately, is what it is. And in fact
motivates to want this merged earlier so I can work against it :)


>  	 */
> -	anon_vma_unlock_write(vma->anon_vma);
> +	if (is_pmd_order(order)) {
> +		anon_vma_unlock_write(vma->anon_vma);
> +		anon_vma_locked = false;
> +	}
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   HPAGE_PMD_ORDER,
> -					   &compound_pagelist);
> -	pte_unmap(pte);
> +					   vma, start_addr, pte_ptl,
> +					   order, &compound_pagelist);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>
> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
>  	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> +	WARN_ON_ONCE(!pmd_none(*pmd));
> +	if (is_pmd_order(order)) {
> +		pgtable = pmd_pgtable(_pmd);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> +	} else {
> +		/*
> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
> +		 * pmd_ptl lock still held; this is safe as the PMD is expected

PMD entry you mean?

> +		 * to be none. The pmd entry is then repopulated below.
> +		 */
> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);

So here we populate entries in the existing PTE _table_ to point at the new
order>0 folio? With arm64 of course doing transparent contpte stuff?

> +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));

And then we reinstall the pre-existing PMD _entry_ from none -> what it was
before?

> +	}
>  	spin_unlock(pmd_ptl);
>
>  	folio = NULL;
>
>  	result = SCAN_SUCCEED;
>  out_up_write:
> +	if (anon_vma_locked)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
>  	if (folio)
> @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, HPAGE_PMD_ORDER);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*lock_dropped = true;
>  	}
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-04 11:38   ` Lorenzo Stoakes
@ 2026-06-04 12:39     ` Lorenzo Stoakes
  2026-06-04 12:45       ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 12:39 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> I will go review the thread about the cache maintenance separately and
> respond about that.
>
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the mTHP case this is not true, and we must keep the lock to prevent
> > access/changes to the page tables. This can happen if the rmap walkers hit
> > a pmd_none while the PMD entry is currently unavailable due to being
> > temporarily removed during the collapse phase.
> >
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> The logic LGTM generally, some questions for understanding below, and of
> course as per above I want to review the Lance/David subthread.
>
> Thanks!
>
> > ---
> >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> >  1 file changed, 55 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fab35d318641..d64f42f66236 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> >   */
> > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > -		int referenced, int unmapped, struct collapse_control *cc)
> > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > +		int referenced, int unmapped, struct collapse_control *cc,
> > +		unsigned int order)
> >  {
> > +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> >  	LIST_HEAD(compound_pagelist);
> >  	pmd_t *pmd, _pmd;
> > -	pte_t *pte;
> > +	pte_t *pte = NULL;
>
> As mentioned elsewhere for some reason this was dropped in
> mm-unstable. Maybe a bad conflict resolution?
>
> >  	pgtable_t pgtable;
> >  	struct folio *folio;
> >  	spinlock_t *pmd_ptl, *pte_ptl;
> >  	enum scan_result result = SCAN_FAIL;
> >  	struct vm_area_struct *vma;
> >  	struct mmu_notifier_range range;
> > +	bool anon_vma_locked = false;
> >
> > -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > -
> > -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +	result = alloc_charge_folio(&folio, mm, cc, order);
> >  	if (result != SCAN_SUCCEED)
> >  		goto out_nolock;
> >
> >  	mmap_read_lock(mm);
> > -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -					 HPAGE_PMD_ORDER);
> > +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +					 &vma, cc, order);
> >  	if (result != SCAN_SUCCEED) {
> >  		mmap_read_unlock(mm);
> >  		goto out_nolock;
> >  	}
> >
> > -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> >  	if (result != SCAN_SUCCEED) {
> >  		mmap_read_unlock(mm);
> >  		goto out_nolock;
> > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  		 * released when it fails. So we jump out_nolock directly in
> >  		 * that case.  Continuing to collapse causes inconsistency.
> >  		 */
> > -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -						     referenced, HPAGE_PMD_ORDER);
> > +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > +						     referenced, order);
> >  		if (result != SCAN_SUCCEED)
> >  			goto out_nolock;
> >  	}
> > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  	 * mmap_lock.
> >  	 */
> >  	mmap_write_lock(mm);
> > -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -					 HPAGE_PMD_ORDER);
> > +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +					 &vma, cc, order);
> >  	if (result != SCAN_SUCCEED)
> >  		goto out_up_write;
> >  	/* check if the pmd is still valid */
> >  	vma_start_write(vma);

Hmm actually I think we have another problem here.

For PMD THP this is fine. Only a single VMA can span the range we need, and it
will span the entire PMD.

But for mTHP we have an issue...

See below...

> > -	result = check_pmd_still_valid(mm, address, pmd);
> > +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
> >  	if (result != SCAN_SUCCEED)
> >  		goto out_up_write;
> >
> >  	anon_vma_lock_write(vma->anon_vma);
> > +	anon_vma_locked = true;
>
> I worry that we hold this lock a lot longer now? Maybe the algorithmic
> change alters that, but Claude did suggest on the s390 bug that longer lock
> hold might be an issue.
>
> I wonder if we'll observe lock contention as a result?
>
> Correct me if I'm wrong and we're not holding longer than previously,
> however. Just appears that we do.
>
> >
> > -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -				address + HPAGE_PMD_SIZE);
> > +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > +				end_addr);
> >  	mmu_notifier_invalidate_range_start(&range);
> >
> >  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  	 * Parallel GUP-fast is fine since GUP-fast will back off when
> >  	 * it detects PMD is changed.
> >  	 */
> > -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> > +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);

...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
but for mTHP we might have _another_ VMA that spans another part of the range
mapped by the same PMD entry.

So we clear this, but we do not have a write lock on any other VMA, and so
racing VMA read locks can install a new PMD entry.

> >  	spin_unlock(pmd_ptl);

Especially since you unlock this :)

And...

> >  	mmu_notifier_invalidate_range_end(&range);
> >  	tlb_remove_table_sync_one();
> >
> > -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> >  	if (pte) {
> > -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -						      HPAGE_PMD_ORDER,
> > -						      &compound_pagelist);
> > +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > +						      order, &compound_pagelist);
> >  		spin_unlock(pte_ptl);
> >  	} else {
> >  		result = SCAN_NO_PTE_TABLE;
> >  	}
> >
> >  	if (unlikely(result != SCAN_SUCCEED)) {
> > -		if (pte)
> > -			pte_unmap(pte);
>
> OK I seem to remember this is because we're holding the anon_vma lock
> longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> bit longer also as well as the anon_vma loc.
>
> I guess it's also because we need to hold anon_vma and pte lock because
> we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> THP did.
>
> (Rememberings going on here :)
>
> >  		spin_lock(pmd_ptl);
> > -		BUG_ON(!pmd_none(*pmd));
> > +		WARN_ON_ONCE(!pmd_none(*pmd));

...this will get triggered.

I don't know whether we can safely hold the PMD lock across everything here for
mTHP?

Maybe the solution would have to be to scan through VMAs in the range of the PMD
and VMA write lock each of them?

That could cause some 'interesting' lock contention issues though? Then again,
we will be releasing the mmap write lock soon enough which will drop the VMA
write locks.

> >  		/*
> >  		 * We can only use set_pmd_at when establishing
> >  		 * hugepmds and never for establishing regular pmds that
> > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  		 */
> >  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >  		spin_unlock(pmd_ptl);
> > -		anon_vma_unlock_write(vma->anon_vma);
> >  		goto out_up_write;
> >  	}
> >
> >  	/*
> > -	 * All pages are isolated and locked so anon_vma rmap
> > -	 * can't run anymore.
> > +	 * For PMD collapse all pages are isolated and locked so anon_vma
> > +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > +	 * removed and not all pages are isolated and locked, so we must hold
>
> Right because some PTE entries be unaffected by the change.
>
> > +	 * the lock to prevent neighboring folios from attempting to access
> > +	 * this PMD until its reinstalled.
>
> OK. This is slightly annoying for my CoW context work as it means there's
> another case where we need to explicitly hold an anon_vma lock for
> correctness :)
>
> Anyway I will think about that separately, is what it is. And in fact
> motivates to want this merged earlier so I can work against it :)
>
>
> >  	 */
> > -	anon_vma_unlock_write(vma->anon_vma);
> > +	if (is_pmd_order(order)) {
> > +		anon_vma_unlock_write(vma->anon_vma);
> > +		anon_vma_locked = false;
> > +	}
> >
> >  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -					   vma, address, pte_ptl,
> > -					   HPAGE_PMD_ORDER,
> > -					   &compound_pagelist);
> > -	pte_unmap(pte);
> > +					   vma, start_addr, pte_ptl,
> > +					   order, &compound_pagelist);
> >  	if (unlikely(result != SCAN_SUCCEED))
> >  		goto out_up_write;
> >
> > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  	 * write.
> >  	 */
> >  	__folio_mark_uptodate(folio);
> > -	pgtable = pmd_pgtable(_pmd);
> > -
> >  	spin_lock(pmd_ptl);
> > -	BUG_ON(!pmd_none(*pmd));
> > -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > +	WARN_ON_ONCE(!pmd_none(*pmd));
> > +	if (is_pmd_order(order)) {
> > +		pgtable = pmd_pgtable(_pmd);
> > +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > +	} else {
> > +		/*
> > +		 * set_ptes is called in map_anon_folio_pte_nopf with the
> > +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>
> PMD entry you mean?
>
> > +		 * to be none. The pmd entry is then repopulated below.
> > +		 */
> > +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>
> So here we populate entries in the existing PTE _table_ to point at the new
> order>0 folio? With arm64 of course doing transparent contpte stuff?
>
> > +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> before?
>
> > +	}
> >  	spin_unlock(pmd_ptl);
> >
> >  	folio = NULL;
> >
> >  	result = SCAN_SUCCEED;
> >  out_up_write:
> > +	if (anon_vma_locked)
> > +		anon_vma_unlock_write(vma->anon_vma);
> > +	if (pte)
> > +		pte_unmap(pte);
> >  	mmap_write_unlock(mm);
> >  out_nolock:
> >  	if (folio)
> > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >  		/* collapse_huge_page expects the lock to be dropped before calling */
> >  		mmap_read_unlock(mm);
> >  		result = collapse_huge_page(mm, start_addr, referenced,
> > -					    unmapped, cc);
> > +					    unmapped, cc, HPAGE_PMD_ORDER);
> >  		/* collapse_huge_page will return with the mmap_lock released */
> >  		*lock_dropped = true;
> >  	}
> > --
> > 2.54.0
> >

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-04 12:39     ` Lorenzo Stoakes
@ 2026-06-04 12:45       ` Nico Pache
  2026-06-04 12:55         ` Lorenzo Stoakes
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-06-04 12:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > I will go review the thread about the cache maintenance separately and
> > respond about that.
> >
> > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > are attempting to collapse to, and offset indicates were in the PMD to
> > > start the collapse attempt.
> > >
> > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > the mTHP case this is not true, and we must keep the lock to prevent
> > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > a pmd_none while the PMD entry is currently unavailable due to being
> > > temporarily removed during the collapse phase.
> > >
> > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> >
> > The logic LGTM generally, some questions for understanding below, and of
> > course as per above I want to review the Lance/David subthread.
> >
> > Thanks!
> >
> > > ---
> > >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > >  1 file changed, 55 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index fab35d318641..d64f42f66236 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> > >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > >   */
> > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > -           int referenced, int unmapped, struct collapse_control *cc)
> > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > +           int referenced, int unmapped, struct collapse_control *cc,
> > > +           unsigned int order)
> > >  {
> > > +   const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > +   const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > >     LIST_HEAD(compound_pagelist);
> > >     pmd_t *pmd, _pmd;
> > > -   pte_t *pte;
> > > +   pte_t *pte = NULL;
> >
> > As mentioned elsewhere for some reason this was dropped in
> > mm-unstable. Maybe a bad conflict resolution?
> >
> > >     pgtable_t pgtable;
> > >     struct folio *folio;
> > >     spinlock_t *pmd_ptl, *pte_ptl;
> > >     enum scan_result result = SCAN_FAIL;
> > >     struct vm_area_struct *vma;
> > >     struct mmu_notifier_range range;
> > > +   bool anon_vma_locked = false;
> > >
> > > -   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > -
> > > -   result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > +   result = alloc_charge_folio(&folio, mm, cc, order);
> > >     if (result != SCAN_SUCCEED)
> > >             goto out_nolock;
> > >
> > >     mmap_read_lock(mm);
> > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                    HPAGE_PMD_ORDER);
> > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > +                                    &vma, cc, order);
> > >     if (result != SCAN_SUCCEED) {
> > >             mmap_read_unlock(mm);
> > >             goto out_nolock;
> > >     }
> > >
> > > -   result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > +   result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > >     if (result != SCAN_SUCCEED) {
> > >             mmap_read_unlock(mm);
> > >             goto out_nolock;
> > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >              * released when it fails. So we jump out_nolock directly in
> > >              * that case.  Continuing to collapse causes inconsistency.
> > >              */
> > > -           result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > -                                                referenced, HPAGE_PMD_ORDER);
> > > +           result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > +                                                referenced, order);
> > >             if (result != SCAN_SUCCEED)
> > >                     goto out_nolock;
> > >     }
> > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >      * mmap_lock.
> > >      */
> > >     mmap_write_lock(mm);
> > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                    HPAGE_PMD_ORDER);
> > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > +                                    &vma, cc, order);
> > >     if (result != SCAN_SUCCEED)
> > >             goto out_up_write;
> > >     /* check if the pmd is still valid */
> > >     vma_start_write(vma);
>
> Hmm actually I think we have another problem here.
>
> For PMD THP this is fine. Only a single VMA can span the range we need, and it
> will span the entire PMD.
>
> But for mTHP we have an issue...
>
> See below...
>
> > > -   result = check_pmd_still_valid(mm, address, pmd);
> > > +   result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > >     if (result != SCAN_SUCCEED)
> > >             goto out_up_write;
> > >
> > >     anon_vma_lock_write(vma->anon_vma);
> > > +   anon_vma_locked = true;
> >
> > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > change alters that, but Claude did suggest on the s390 bug that longer lock
> > hold might be an issue.
> >
> > I wonder if we'll observe lock contention as a result?
> >
> > Correct me if I'm wrong and we're not holding longer than previously,
> > however. Just appears that we do.
> >
> > >
> > > -   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > -                           address + HPAGE_PMD_SIZE);
> > > +   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > +                           end_addr);
> > >     mmu_notifier_invalidate_range_start(&range);
> > >
> > >     pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >      * Parallel GUP-fast is fine since GUP-fast will back off when
> > >      * it detects PMD is changed.
> > >      */
> > > -   _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > +   _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>
> ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> but for mTHP we might have _another_ VMA that spans another part of the range
> mapped by the same PMD entry.
>
> So we clear this, but we do not have a write lock on any other VMA, and so
> racing VMA read locks can install a new PMD entry.
>
> > >     spin_unlock(pmd_ptl);
>
> Especially since you unlock this :)
>
> And...
>
> > >     mmu_notifier_invalidate_range_end(&range);
> > >     tlb_remove_table_sync_one();
> > >
> > > -   pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > +   pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > >     if (pte) {
> > > -           result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > -                                                 HPAGE_PMD_ORDER,
> > > -                                                 &compound_pagelist);
> > > +           result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > +                                                 order, &compound_pagelist);
> > >             spin_unlock(pte_ptl);
> > >     } else {
> > >             result = SCAN_NO_PTE_TABLE;
> > >     }
> > >
> > >     if (unlikely(result != SCAN_SUCCEED)) {
> > > -           if (pte)
> > > -                   pte_unmap(pte);
> >
> > OK I seem to remember this is because we're holding the anon_vma lock
> > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > bit longer also as well as the anon_vma loc.
> >
> > I guess it's also because we need to hold anon_vma and pte lock because
> > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > THP did.
> >
> > (Rememberings going on here :)
> >
> > >             spin_lock(pmd_ptl);
> > > -           BUG_ON(!pmd_none(*pmd));
> > > +           WARN_ON_ONCE(!pmd_none(*pmd));
>
> ...this will get triggered.
>
> I don't know whether we can safely hold the PMD lock across everything here for
> mTHP?
>
> Maybe the solution would have to be to scan through VMAs in the range of the PMD
> and VMA write lock each of them?

I believe we've spoken about this before, but because we always make
sure the VMA spans the full PMD we won't ever hit this issue. If we
wanted to support mTHP collapse on regions smaller than a PMD, the
locking gets tricky (hence the design choice to not do that for now).

This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().

/* Always check the PMD order to ensure its not shared by another VMA */
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))

-- Nico

>
> That could cause some 'interesting' lock contention issues though? Then again,
> we will be releasing the mmap write lock soon enough which will drop the VMA
> write locks.
>
> > >             /*
> > >              * We can only use set_pmd_at when establishing
> > >              * hugepmds and never for establishing regular pmds that
> > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >              */
> > >             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > >             spin_unlock(pmd_ptl);
> > > -           anon_vma_unlock_write(vma->anon_vma);
> > >             goto out_up_write;
> > >     }
> > >
> > >     /*
> > > -    * All pages are isolated and locked so anon_vma rmap
> > > -    * can't run anymore.
> > > +    * For PMD collapse all pages are isolated and locked so anon_vma
> > > +    * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > +    * removed and not all pages are isolated and locked, so we must hold
> >
> > Right because some PTE entries be unaffected by the change.
> >
> > > +    * the lock to prevent neighboring folios from attempting to access
> > > +    * this PMD until its reinstalled.
> >
> > OK. This is slightly annoying for my CoW context work as it means there's
> > another case where we need to explicitly hold an anon_vma lock for
> > correctness :)
> >
> > Anyway I will think about that separately, is what it is. And in fact
> > motivates to want this merged earlier so I can work against it :)
> >
> >
> > >      */
> > > -   anon_vma_unlock_write(vma->anon_vma);
> > > +   if (is_pmd_order(order)) {
> > > +           anon_vma_unlock_write(vma->anon_vma);
> > > +           anon_vma_locked = false;
> > > +   }
> > >
> > >     result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > -                                      vma, address, pte_ptl,
> > > -                                      HPAGE_PMD_ORDER,
> > > -                                      &compound_pagelist);
> > > -   pte_unmap(pte);
> > > +                                      vma, start_addr, pte_ptl,
> > > +                                      order, &compound_pagelist);
> > >     if (unlikely(result != SCAN_SUCCEED))
> > >             goto out_up_write;
> > >
> > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >      * write.
> > >      */
> > >     __folio_mark_uptodate(folio);
> > > -   pgtable = pmd_pgtable(_pmd);
> > > -
> > >     spin_lock(pmd_ptl);
> > > -   BUG_ON(!pmd_none(*pmd));
> > > -   pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > -   map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > +   WARN_ON_ONCE(!pmd_none(*pmd));
> > > +   if (is_pmd_order(order)) {
> > > +           pgtable = pmd_pgtable(_pmd);
> > > +           pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > +           map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > +   } else {
> > > +           /*
> > > +            * set_ptes is called in map_anon_folio_pte_nopf with the
> > > +            * pmd_ptl lock still held; this is safe as the PMD is expected
> >
> > PMD entry you mean?
> >
> > > +            * to be none. The pmd entry is then repopulated below.
> > > +            */
> > > +           map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> >
> > So here we populate entries in the existing PTE _table_ to point at the new
> > order>0 folio? With arm64 of course doing transparent contpte stuff?
> >
> > > +           smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > +           pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >
> > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > before?
> >
> > > +   }
> > >     spin_unlock(pmd_ptl);
> > >
> > >     folio = NULL;
> > >
> > >     result = SCAN_SUCCEED;
> > >  out_up_write:
> > > +   if (anon_vma_locked)
> > > +           anon_vma_unlock_write(vma->anon_vma);
> > > +   if (pte)
> > > +           pte_unmap(pte);
> > >     mmap_write_unlock(mm);
> > >  out_nolock:
> > >     if (folio)
> > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >             /* collapse_huge_page expects the lock to be dropped before calling */
> > >             mmap_read_unlock(mm);
> > >             result = collapse_huge_page(mm, start_addr, referenced,
> > > -                                       unmapped, cc);
> > > +                                       unmapped, cc, HPAGE_PMD_ORDER);
> > >             /* collapse_huge_page will return with the mmap_lock released */
> > >             *lock_dropped = true;
> > >     }
> > > --
> > > 2.54.0
> > >
>
> Thanks, Lorenzo
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-04 12:45       ` Nico Pache
@ 2026-06-04 12:55         ` Lorenzo Stoakes
  2026-06-04 16:28           ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 12:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

On Thu, Jun 04, 2026 at 06:45:58AM -0600, Nico Pache wrote:
> On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > > I will go review the thread about the cache maintenance separately and
> > > respond about that.
> > >
> > > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > > are attempting to collapse to, and offset indicates were in the PMD to
> > > > start the collapse attempt.
> > > >
> > > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > > the mTHP case this is not true, and we must keep the lock to prevent
> > > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > > a pmd_none while the PMD entry is currently unavailable due to being
> > > > temporarily removed during the collapse phase.
> > > >
> > > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > >
> > > The logic LGTM generally, some questions for understanding below, and of
> > > course as per above I want to review the Lance/David subthread.
> > >
> > > Thanks!
> > >
> > > > ---
> > > >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > > >  1 file changed, 55 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index fab35d318641..d64f42f66236 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> > > >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > > >   */
> > > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > -           int referenced, int unmapped, struct collapse_control *cc)
> > > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > > +           int referenced, int unmapped, struct collapse_control *cc,
> > > > +           unsigned int order)
> > > >  {
> > > > +   const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > > +   const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > > >     LIST_HEAD(compound_pagelist);
> > > >     pmd_t *pmd, _pmd;
> > > > -   pte_t *pte;
> > > > +   pte_t *pte = NULL;
> > >
> > > As mentioned elsewhere for some reason this was dropped in
> > > mm-unstable. Maybe a bad conflict resolution?
> > >
> > > >     pgtable_t pgtable;
> > > >     struct folio *folio;
> > > >     spinlock_t *pmd_ptl, *pte_ptl;
> > > >     enum scan_result result = SCAN_FAIL;
> > > >     struct vm_area_struct *vma;
> > > >     struct mmu_notifier_range range;
> > > > +   bool anon_vma_locked = false;
> > > >
> > > > -   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > > -
> > > > -   result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > > +   result = alloc_charge_folio(&folio, mm, cc, order);
> > > >     if (result != SCAN_SUCCEED)
> > > >             goto out_nolock;
> > > >
> > > >     mmap_read_lock(mm);
> > > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > -                                    HPAGE_PMD_ORDER);
> > > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > +                                    &vma, cc, order);
> > > >     if (result != SCAN_SUCCEED) {
> > > >             mmap_read_unlock(mm);
> > > >             goto out_nolock;
> > > >     }
> > > >
> > > > -   result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > +   result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > > >     if (result != SCAN_SUCCEED) {
> > > >             mmap_read_unlock(mm);
> > > >             goto out_nolock;
> > > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >              * released when it fails. So we jump out_nolock directly in
> > > >              * that case.  Continuing to collapse causes inconsistency.
> > > >              */
> > > > -           result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > > -                                                referenced, HPAGE_PMD_ORDER);
> > > > +           result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > > +                                                referenced, order);
> > > >             if (result != SCAN_SUCCEED)
> > > >                     goto out_nolock;
> > > >     }
> > > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >      * mmap_lock.
> > > >      */
> > > >     mmap_write_lock(mm);
> > > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > -                                    HPAGE_PMD_ORDER);
> > > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > +                                    &vma, cc, order);
> > > >     if (result != SCAN_SUCCEED)
> > > >             goto out_up_write;
> > > >     /* check if the pmd is still valid */
> > > >     vma_start_write(vma);
> >
> > Hmm actually I think we have another problem here.
> >
> > For PMD THP this is fine. Only a single VMA can span the range we need, and it
> > will span the entire PMD.
> >
> > But for mTHP we have an issue...
> >
> > See below...
> >
> > > > -   result = check_pmd_still_valid(mm, address, pmd);
> > > > +   result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > > >     if (result != SCAN_SUCCEED)
> > > >             goto out_up_write;
> > > >
> > > >     anon_vma_lock_write(vma->anon_vma);
> > > > +   anon_vma_locked = true;
> > >
> > > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > > change alters that, but Claude did suggest on the s390 bug that longer lock
> > > hold might be an issue.
> > >
> > > I wonder if we'll observe lock contention as a result?
> > >
> > > Correct me if I'm wrong and we're not holding longer than previously,
> > > however. Just appears that we do.
> > >
> > > >
> > > > -   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > -                           address + HPAGE_PMD_SIZE);
> > > > +   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > > +                           end_addr);
> > > >     mmu_notifier_invalidate_range_start(&range);
> > > >
> > > >     pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >      * Parallel GUP-fast is fine since GUP-fast will back off when
> > > >      * it detects PMD is changed.
> > > >      */
> > > > -   _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > > +   _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> >
> > ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> > but for mTHP we might have _another_ VMA that spans another part of the range
> > mapped by the same PMD entry.
> >
> > So we clear this, but we do not have a write lock on any other VMA, and so
> > racing VMA read locks can install a new PMD entry.
> >
> > > >     spin_unlock(pmd_ptl);
> >
> > Especially since you unlock this :)
> >
> > And...
> >
> > > >     mmu_notifier_invalidate_range_end(&range);
> > > >     tlb_remove_table_sync_one();
> > > >
> > > > -   pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > > +   pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > >     if (pte) {
> > > > -           result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > > -                                                 HPAGE_PMD_ORDER,
> > > > -                                                 &compound_pagelist);
> > > > +           result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > > +                                                 order, &compound_pagelist);
> > > >             spin_unlock(pte_ptl);
> > > >     } else {
> > > >             result = SCAN_NO_PTE_TABLE;
> > > >     }
> > > >
> > > >     if (unlikely(result != SCAN_SUCCEED)) {
> > > > -           if (pte)
> > > > -                   pte_unmap(pte);
> > >
> > > OK I seem to remember this is because we're holding the anon_vma lock
> > > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > > bit longer also as well as the anon_vma loc.
> > >
> > > I guess it's also because we need to hold anon_vma and pte lock because
> > > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > > THP did.
> > >
> > > (Rememberings going on here :)
> > >
> > > >             spin_lock(pmd_ptl);
> > > > -           BUG_ON(!pmd_none(*pmd));
> > > > +           WARN_ON_ONCE(!pmd_none(*pmd));
> >
> > ...this will get triggered.
> >
> > I don't know whether we can safely hold the PMD lock across everything here for
> > mTHP?
> >
> > Maybe the solution would have to be to scan through VMAs in the range of the PMD
> > and VMA write lock each of them?
>
> I believe we've spoken about this before, but because we always make

Maybe worth a comment then...? Ah how rewarding review is :)

This is something that somebody else might very well wonder about and
forget that it happens to be covered there.

Also:

/* Always check the PMD order to ensure its not shared by another VMA */

Is pretty lightweight there. Something about avoiding racing page faults
would be helpful.

> sure the VMA spans the full PMD we won't ever hit this issue. If we
> wanted to support mTHP collapse on regions smaller than a PMD, the
> locking gets tricky (hence the design choice to not do that for now).
>
> This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().

The existing code is atrocious, and sticking this on top has added to the
pile of assumptions and conventions and having to go check a bunch of
functions to 'just know' you're safe for X, Y, Z.

We really need to see some cleanup series coming after this and I'm going
to get pretty grumpy(ier) if we don't.

>
> /* Always check the PMD order to ensure its not shared by another VMA */
> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>
> -- Nico
>
> >
> > That could cause some 'interesting' lock contention issues though? Then again,
> > we will be releasing the mmap write lock soon enough which will drop the VMA
> > write locks.
> >
> > > >             /*
> > > >              * We can only use set_pmd_at when establishing
> > > >              * hugepmds and never for establishing regular pmds that
> > > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >              */
> > > >             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > >             spin_unlock(pmd_ptl);
> > > > -           anon_vma_unlock_write(vma->anon_vma);
> > > >             goto out_up_write;
> > > >     }
> > > >
> > > >     /*
> > > > -    * All pages are isolated and locked so anon_vma rmap
> > > > -    * can't run anymore.
> > > > +    * For PMD collapse all pages are isolated and locked so anon_vma
> > > > +    * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > > +    * removed and not all pages are isolated and locked, so we must hold
> > >
> > > Right because some PTE entries be unaffected by the change.
> > >
> > > > +    * the lock to prevent neighboring folios from attempting to access
> > > > +    * this PMD until its reinstalled.
> > >
> > > OK. This is slightly annoying for my CoW context work as it means there's
> > > another case where we need to explicitly hold an anon_vma lock for
> > > correctness :)
> > >
> > > Anyway I will think about that separately, is what it is. And in fact
> > > motivates to want this merged earlier so I can work against it :)
> > >
> > >
> > > >      */
> > > > -   anon_vma_unlock_write(vma->anon_vma);
> > > > +   if (is_pmd_order(order)) {
> > > > +           anon_vma_unlock_write(vma->anon_vma);
> > > > +           anon_vma_locked = false;
> > > > +   }
> > > >
> > > >     result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > > -                                      vma, address, pte_ptl,
> > > > -                                      HPAGE_PMD_ORDER,
> > > > -                                      &compound_pagelist);
> > > > -   pte_unmap(pte);
> > > > +                                      vma, start_addr, pte_ptl,
> > > > +                                      order, &compound_pagelist);
> > > >     if (unlikely(result != SCAN_SUCCEED))
> > > >             goto out_up_write;
> > > >
> > > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >      * write.
> > > >      */
> > > >     __folio_mark_uptodate(folio);
> > > > -   pgtable = pmd_pgtable(_pmd);
> > > > -
> > > >     spin_lock(pmd_ptl);
> > > > -   BUG_ON(!pmd_none(*pmd));
> > > > -   pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > -   map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > > +   WARN_ON_ONCE(!pmd_none(*pmd));
> > > > +   if (is_pmd_order(order)) {
> > > > +           pgtable = pmd_pgtable(_pmd);
> > > > +           pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > +           map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > > +   } else {
> > > > +           /*
> > > > +            * set_ptes is called in map_anon_folio_pte_nopf with the
> > > > +            * pmd_ptl lock still held; this is safe as the PMD is expected
> > >
> > > PMD entry you mean?
> > >
> > > > +            * to be none. The pmd entry is then repopulated below.
> > > > +            */
> > > > +           map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > >
> > > So here we populate entries in the existing PTE _table_ to point at the new
> > > order>0 folio? With arm64 of course doing transparent contpte stuff?
> > >
> > > > +           smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > > +           pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > >
> > > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > > before?
> > >
> > > > +   }
> > > >     spin_unlock(pmd_ptl);
> > > >
> > > >     folio = NULL;
> > > >
> > > >     result = SCAN_SUCCEED;
> > > >  out_up_write:
> > > > +   if (anon_vma_locked)
> > > > +           anon_vma_unlock_write(vma->anon_vma);
> > > > +   if (pte)
> > > > +           pte_unmap(pte);
> > > >     mmap_write_unlock(mm);
> > > >  out_nolock:
> > > >     if (folio)
> > > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > >             /* collapse_huge_page expects the lock to be dropped before calling */
> > > >             mmap_read_unlock(mm);
> > > >             result = collapse_huge_page(mm, start_addr, referenced,
> > > > -                                       unmapped, cc);
> > > > +                                       unmapped, cc, HPAGE_PMD_ORDER);
> > > >             /* collapse_huge_page will return with the mmap_lock released */
> > > >             *lock_dropped = true;
> > > >     }
> > > > --
> > > > 2.54.0
> > > >
> >
> > Thanks, Lorenzo
> >
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
  2026-06-04 12:55         ` Lorenzo Stoakes
@ 2026-06-04 16:28           ` Nico Pache
  0 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-04 16:28 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif

On Thu, Jun 4, 2026 at 6:56 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Jun 04, 2026 at 06:45:58AM -0600, Nico Pache wrote:
> > On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > > > I will go review the thread about the cache maintenance separately and
> > > > respond about that.
> > > >
> > > > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > > > are attempting to collapse to, and offset indicates were in the PMD to
> > > > > start the collapse attempt.
> > > > >
> > > > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > > > the mTHP case this is not true, and we must keep the lock to prevent
> > > > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > > > a pmd_none while the PMD entry is currently unavailable due to being
> > > > > temporarily removed during the collapse phase.
> > > > >
> > > > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > >
> > > > The logic LGTM generally, some questions for understanding below, and of
> > > > course as per above I want to review the Lance/David subthread.
> > > >
> > > > Thanks!
> > > >
> > > > > ---
> > > > >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > > > >  1 file changed, 55 insertions(+), 38 deletions(-)
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index fab35d318641..d64f42f66236 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > > >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> > > > >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > > > >   */
> > > > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > > -           int referenced, int unmapped, struct collapse_control *cc)
> > > > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > > > +           int referenced, int unmapped, struct collapse_control *cc,
> > > > > +           unsigned int order)
> > > > >  {
> > > > > +   const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > > > +   const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > > > >     LIST_HEAD(compound_pagelist);
> > > > >     pmd_t *pmd, _pmd;
> > > > > -   pte_t *pte;
> > > > > +   pte_t *pte = NULL;
> > > >
> > > > As mentioned elsewhere for some reason this was dropped in
> > > > mm-unstable. Maybe a bad conflict resolution?
> > > >
> > > > >     pgtable_t pgtable;
> > > > >     struct folio *folio;
> > > > >     spinlock_t *pmd_ptl, *pte_ptl;
> > > > >     enum scan_result result = SCAN_FAIL;
> > > > >     struct vm_area_struct *vma;
> > > > >     struct mmu_notifier_range range;
> > > > > +   bool anon_vma_locked = false;
> > > > >
> > > > > -   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > > > -
> > > > > -   result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > > > +   result = alloc_charge_folio(&folio, mm, cc, order);
> > > > >     if (result != SCAN_SUCCEED)
> > > > >             goto out_nolock;
> > > > >
> > > > >     mmap_read_lock(mm);
> > > > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > > -                                    HPAGE_PMD_ORDER);
> > > > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > > +                                    &vma, cc, order);
> > > > >     if (result != SCAN_SUCCEED) {
> > > > >             mmap_read_unlock(mm);
> > > > >             goto out_nolock;
> > > > >     }
> > > > >
> > > > > -   result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > > +   result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > > > >     if (result != SCAN_SUCCEED) {
> > > > >             mmap_read_unlock(mm);
> > > > >             goto out_nolock;
> > > > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > >              * released when it fails. So we jump out_nolock directly in
> > > > >              * that case.  Continuing to collapse causes inconsistency.
> > > > >              */
> > > > > -           result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > > > -                                                referenced, HPAGE_PMD_ORDER);
> > > > > +           result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > > > +                                                referenced, order);
> > > > >             if (result != SCAN_SUCCEED)
> > > > >                     goto out_nolock;
> > > > >     }
> > > > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > >      * mmap_lock.
> > > > >      */
> > > > >     mmap_write_lock(mm);
> > > > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > > -                                    HPAGE_PMD_ORDER);
> > > > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > > +                                    &vma, cc, order);
> > > > >     if (result != SCAN_SUCCEED)
> > > > >             goto out_up_write;
> > > > >     /* check if the pmd is still valid */
> > > > >     vma_start_write(vma);
> > >
> > > Hmm actually I think we have another problem here.
> > >
> > > For PMD THP this is fine. Only a single VMA can span the range we need, and it
> > > will span the entire PMD.
> > >
> > > But for mTHP we have an issue...
> > >
> > > See below...
> > >
> > > > > -   result = check_pmd_still_valid(mm, address, pmd);
> > > > > +   result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > > > >     if (result != SCAN_SUCCEED)
> > > > >             goto out_up_write;
> > > > >
> > > > >     anon_vma_lock_write(vma->anon_vma);
> > > > > +   anon_vma_locked = true;
> > > >
> > > > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > > > change alters that, but Claude did suggest on the s390 bug that longer lock
> > > > hold might be an issue.
> > > >
> > > > I wonder if we'll observe lock contention as a result?
> > > >
> > > > Correct me if I'm wrong and we're not holding longer than previously,
> > > > however. Just appears that we do.
> > > >
> > > > >
> > > > > -   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > > -                           address + HPAGE_PMD_SIZE);
> > > > > +   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > > > +                           end_addr);
> > > > >     mmu_notifier_invalidate_range_start(&range);
> > > > >
> > > > >     pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > >      * Parallel GUP-fast is fine since GUP-fast will back off when
> > > > >      * it detects PMD is changed.
> > > > >      */
> > > > > -   _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > > > +   _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> > >
> > > ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> > > but for mTHP we might have _another_ VMA that spans another part of the range
> > > mapped by the same PMD entry.
> > >
> > > So we clear this, but we do not have a write lock on any other VMA, and so
> > > racing VMA read locks can install a new PMD entry.
> > >
> > > > >     spin_unlock(pmd_ptl);
> > >
> > > Especially since you unlock this :)
> > >
> > > And...
> > >
> > > > >     mmu_notifier_invalidate_range_end(&range);
> > > > >     tlb_remove_table_sync_one();
> > > > >
> > > > > -   pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > > > +   pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > > >     if (pte) {
> > > > > -           result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > > > -                                                 HPAGE_PMD_ORDER,
> > > > > -                                                 &compound_pagelist);
> > > > > +           result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > > > +                                                 order, &compound_pagelist);
> > > > >             spin_unlock(pte_ptl);
> > > > >     } else {
> > > > >             result = SCAN_NO_PTE_TABLE;
> > > > >     }
> > > > >
> > > > >     if (unlikely(result != SCAN_SUCCEED)) {
> > > > > -           if (pte)
> > > > > -                   pte_unmap(pte);
> > > >
> > > > OK I seem to remember this is because we're holding the anon_vma lock
> > > > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > > > bit longer also as well as the anon_vma loc.
> > > >
> > > > I guess it's also because we need to hold anon_vma and pte lock because
> > > > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > > > THP did.
> > > >
> > > > (Rememberings going on here :)
> > > >
> > > > >             spin_lock(pmd_ptl);
> > > > > -           BUG_ON(!pmd_none(*pmd));
> > > > > +           WARN_ON_ONCE(!pmd_none(*pmd));
> > >
> > > ...this will get triggered.
> > >
> > > I don't know whether we can safely hold the PMD lock across everything here for
> > > mTHP?
> > >
> > > Maybe the solution would have to be to scan through VMAs in the range of the PMD
> > > and VMA write lock each of them?
> >
> > I believe we've spoken about this before, but because we always make
>
> Maybe worth a comment then...? Ah how rewarding review is :)

I'll expand the commit message and comment in commit 1 of the series! thanks

>
> This is something that somebody else might very well wonder about and
> forget that it happens to be covered there.
>
> Also:
>
> /* Always check the PMD order to ensure its not shared by another VMA */
>
> Is pretty lightweight there. Something about avoiding racing page faults
> would be helpful.

yeah fair enough the commit message of patch 1 also doesnt really do
it justice on the *why*

>
> > sure the VMA spans the full PMD we won't ever hit this issue. If we
> > wanted to support mTHP collapse on regions smaller than a PMD, the
> > locking gets tricky (hence the design choice to not do that for now).
> >
> > This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().
>
> The existing code is atrocious, and sticking this on top has added to the
> pile of assumptions and conventions and having to go check a bunch of
> functions to 'just know' you're safe for X, Y, Z.
>
> We really need to see some cleanup series coming after this and I'm going
> to get pretty grumpy(ier) if we don't.

Many more to come :) Improvements too but cleanups first!

Cheers,
-- Nico

>
> >
> > /* Always check the PMD order to ensure its not shared by another VMA */
> > if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> >
> > -- Nico
> >
> > >
> > > That could cause some 'interesting' lock contention issues though? Then again,
> > > we will be releasing the mmap write lock soon enough which will drop the VMA
> > > write locks.
> > >
> > > > >             /*
> > > > >              * We can only use set_pmd_at when establishing
> > > > >              * hugepmds and never for establishing regular pmds that
> > > > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > >              */
> > > > >             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > > >             spin_unlock(pmd_ptl);
> > > > > -           anon_vma_unlock_write(vma->anon_vma);
> > > > >             goto out_up_write;
> > > > >     }
> > > > >
> > > > >     /*
> > > > > -    * All pages are isolated and locked so anon_vma rmap
> > > > > -    * can't run anymore.
> > > > > +    * For PMD collapse all pages are isolated and locked so anon_vma
> > > > > +    * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > > > +    * removed and not all pages are isolated and locked, so we must hold
> > > >
> > > > Right because some PTE entries be unaffected by the change.
> > > >
> > > > > +    * the lock to prevent neighboring folios from attempting to access
> > > > > +    * this PMD until its reinstalled.
> > > >
> > > > OK. This is slightly annoying for my CoW context work as it means there's
> > > > another case where we need to explicitly hold an anon_vma lock for
> > > > correctness :)
> > > >
> > > > Anyway I will think about that separately, is what it is. And in fact
> > > > motivates to want this merged earlier so I can work against it :)
> > > >
> > > >
> > > > >      */
> > > > > -   anon_vma_unlock_write(vma->anon_vma);
> > > > > +   if (is_pmd_order(order)) {
> > > > > +           anon_vma_unlock_write(vma->anon_vma);
> > > > > +           anon_vma_locked = false;
> > > > > +   }
> > > > >
> > > > >     result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > > > -                                      vma, address, pte_ptl,
> > > > > -                                      HPAGE_PMD_ORDER,
> > > > > -                                      &compound_pagelist);
> > > > > -   pte_unmap(pte);
> > > > > +                                      vma, start_addr, pte_ptl,
> > > > > +                                      order, &compound_pagelist);
> > > > >     if (unlikely(result != SCAN_SUCCEED))
> > > > >             goto out_up_write;
> > > > >
> > > > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > >      * write.
> > > > >      */
> > > > >     __folio_mark_uptodate(folio);
> > > > > -   pgtable = pmd_pgtable(_pmd);
> > > > > -
> > > > >     spin_lock(pmd_ptl);
> > > > > -   BUG_ON(!pmd_none(*pmd));
> > > > > -   pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > > -   map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > > > +   WARN_ON_ONCE(!pmd_none(*pmd));
> > > > > +   if (is_pmd_order(order)) {
> > > > > +           pgtable = pmd_pgtable(_pmd);
> > > > > +           pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > > +           map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > > > +   } else {
> > > > > +           /*
> > > > > +            * set_ptes is called in map_anon_folio_pte_nopf with the
> > > > > +            * pmd_ptl lock still held; this is safe as the PMD is expected
> > > >
> > > > PMD entry you mean?
> > > >
> > > > > +            * to be none. The pmd entry is then repopulated below.
> > > > > +            */
> > > > > +           map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > > >
> > > > So here we populate entries in the existing PTE _table_ to point at the new
> > > > order>0 folio? With arm64 of course doing transparent contpte stuff?
> > > >
> > > > > +           smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > > > +           pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > >
> > > > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > > > before?
> > > >
> > > > > +   }
> > > > >     spin_unlock(pmd_ptl);
> > > > >
> > > > >     folio = NULL;
> > > > >
> > > > >     result = SCAN_SUCCEED;
> > > > >  out_up_write:
> > > > > +   if (anon_vma_locked)
> > > > > +           anon_vma_unlock_write(vma->anon_vma);
> > > > > +   if (pte)
> > > > > +           pte_unmap(pte);
> > > > >     mmap_write_unlock(mm);
> > > > >  out_nolock:
> > > > >     if (folio)
> > > > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > > >             /* collapse_huge_page expects the lock to be dropped before calling */
> > > > >             mmap_read_unlock(mm);
> > > > >             result = collapse_huge_page(mm, start_addr, referenced,
> > > > > -                                       unmapped, cc);
> > > > > +                                       unmapped, cc, HPAGE_PMD_ORDER);
> > > > >             /* collapse_huge_page will return with the mmap_lock released */
> > > > >             *lock_dropped = true;
> > > > >     }
> > > > > --
> > > > > 2.54.0
> > > > >
> > >
> > > Thanks, Lorenzo
> > >
> >
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (5 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-22 21:51   ` David Hildenbrand (Arm)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio). This check is also not done during the scan phase
as the current collapse order is unknown at that time.

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d64f42f66236..928e32a0d4d7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -689,6 +689,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				goto out;
 			}
 		}
+		/*
+		 * TODO: In some cases of partially-mapped folios, we'd actually
+		 * want to collapse.
+		 */
+		if (!is_pmd_order(order) && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
 
 		if (folio_test_large(folio)) {
 			struct folio *f;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders
  2026-05-22 15:00 ` [PATCH mm-unstable v18 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2026-05-22 21:51   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-22 21:51 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Usama Arif

On 5/22/26 17:00, Nico Pache wrote:
> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> some pages being unmapped. 

The "some pages being unmapped" part is unclear.

I assume what you mean is "possibly resulting in a partially mapped source
folio, which is undesired."

But there is also the problem that we could try collapsing a folio to a
same-sized folio, which doesn't make sense (assuming the folio is fully mapped).

Clarify all that, please.

Acked-by: David Hildenbrand (arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (6 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-31 20:09   ` David Hildenbrand (Arm)
  2026-06-01 14:13   ` Lorenzo Stoakes
  2026-05-22 15:00 ` [PATCH mm-unstable v18 09/14] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

Add three new mTHP statistics to track collapse failures for different
orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:

- collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
	encountering a swap PTE.

- collapse_exceed_none_pte: Counts when mTHP collapse fails due to
  	exceeding the none PTE threshold for the given order

- collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
	encountering a shared PTE.

These statistics complement the existing THP_SCAN_EXCEED_* events by
providing per-order granularity for mTHP collapse attempts. The stats are
exposed via sysfs under
`/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
supported hugepage size.

As we currently do not support collapsing mTHPs that contain a swap or
shared entry, those statistics keep track of how often we are
encountering failed mTHP collapses due to these restrictions.

We will add support for mTHP collapse for anonymous pages next; lets also
track when this happens at the PMD level within the per-mTHP stats.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
 include/linux/huge_mm.h                    |  3 +++
 mm/huge_memory.c                           |  7 +++++++
 mm/khugepaged.c                            | 21 +++++++++++++++++++--
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index c51932e6275d..80a4d0bed70b 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -714,6 +714,20 @@ nr_anon_partially_mapped
        an anonymous THP as "partially mapped" and count it here, even though it
        is not actually partially mapped anymore.
 
+collapse_exceed_none_pte
+       The number of collapse attempts that failed due to exceeding the
+       max_ptes_none threshold.
+
+collapse_exceed_swap_pte
+       The number of collapse attempts that failed due to exceeding the
+       max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
+       contains at least one swap PTE.
+
+collapse_exceed_shared_pte
+       The number of collapse attempts that failed due to exceeding the
+       max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
+       contains at least one shared PTE.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ba7ae6808544..48496f09909b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -144,6 +144,9 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT_DEFERRED,
 	MTHP_STAT_NR_ANON,
 	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+	MTHP_STAT_COLLAPSE_EXCEED_NONE,
+	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 345c54133c83..5c128cdec810 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
 DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
 DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -719,6 +723,9 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_deferred_attr.attr,
 	&nr_anon_attr.attr,
 	&nr_anon_partially_mapped_attr.attr,
+	&collapse_exceed_swap_pte_attr.attr,
+	&collapse_exceed_none_pte_attr.attr,
+	&collapse_exceed_shared_pte_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 928e32a0d4d7..fff6a8fbf1d4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -649,7 +649,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_none_or_zero(pteval)) {
 			if (++none_or_zero > max_ptes_none) {
 				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				if (is_pmd_order(order))
+					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
 				goto out;
 			}
 			continue;
@@ -683,9 +685,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		/* See collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
+			/*
+			 * TODO: Support shared pages without leading to further
+			 * mTHP collapses. Currently bringing in new pages via
+			 * shared may cause a future higher order collapse on a
+			 * rescan of the same range.
+			 */
 			if (++shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
-				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+				if (is_pmd_order(order))
+					count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
 				goto out;
 			}
 		}
@@ -1138,6 +1148,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 		 * range.
 		 */
 		if (!is_pmd_order(order)) {
+			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
 			pte_unmap(pte);
 			mmap_read_unlock(mm);
 			result = SCAN_EXCEED_SWAP_PTE;
@@ -1433,6 +1444,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 			if (++none_or_zero > max_ptes_none) {
 				result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				count_mthp_stat(HPAGE_PMD_ORDER,
+						MTHP_STAT_COLLAPSE_EXCEED_NONE);
 				goto out_unmap;
 			}
 			continue;
@@ -1441,6 +1454,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 			if (++unmapped > max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
+				count_mthp_stat(HPAGE_PMD_ORDER,
+						MTHP_STAT_COLLAPSE_EXCEED_SWAP);
 				goto out_unmap;
 			}
 			/*
@@ -1498,6 +1513,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 			if (++shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+				count_mthp_stat(HPAGE_PMD_ORDER,
+						MTHP_STAT_COLLAPSE_EXCEED_SHARED);
 				goto out_unmap;
 			}
 		}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
  2026-05-22 15:00 ` [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
@ 2026-05-31 20:09   ` David Hildenbrand (Arm)
  2026-06-01 14:13   ` Lorenzo Stoakes
  1 sibling, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-31 20:09 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 5/22/26 17:00, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> 
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
> 	encountering a swap PTE.
> 
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
> 
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
> 	encountering a shared PTE.
> 
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
> 
> As we currently do not support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
> 
> We will add support for mTHP collapse for anonymous pages next; lets also
> track when this happens at the PMD level within the per-mTHP stats.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 21 +++++++++++++++++++--
>  4 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..80a4d0bed70b 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,20 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>  
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold.
> +
> +collapse_exceed_swap_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one shared PTE.
> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index ba7ae6808544..48496f09909b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 345c54133c83..5c128cdec810 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED)

[...]

>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> +			/*
> +			 * TODO: Support shared pages without leading to further
> +			 * mTHP collapses. Currently bringing in new pages via
> +			 * shared may cause a future higher order collapse on a
> +			 * rescan of the same range.
> +			 */

This comment actually belongs into an earlier patch, no?

>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;
>  			}
>  		}
> @@ -1138,6 +1148,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (!is_pmd_order(order)) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> @@ -1433,6 +1444,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out_unmap;
>  			}
>  			continue;
> @@ -1441,6 +1454,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  				goto out_unmap;
>  			}
>  			/*
> @@ -1498,6 +1513,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SHARED);

Can be done as a later cleanup, but having a single function that obtains an
order and knows which stats to update would be cleaner (and a good preparation
for shmem mTHP collapse support).

Nothing jumped at me, so

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
  2026-05-22 15:00 ` [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
  2026-05-31 20:09   ` David Hildenbrand (Arm)
@ 2026-06-01 14:13   ` Lorenzo Stoakes
  1 sibling, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-01 14:13 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Fri, May 22, 2026 at 09:00:03AM -0600, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
>
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
> 	encountering a swap PTE.
>
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
>
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
> 	encountering a shared PTE.
>
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
>
> As we currently do not support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
>
> We will add support for mTHP collapse for anonymous pages next; lets also
> track when this happens at the PMD level within the per-mTHP stats.
>
> Signed-off-by: Nico Pache <npache@redhat.com>

Logic LGTM, small comment below, but otherwise:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 21 +++++++++++++++++++--
>  4 files changed, 43 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..80a4d0bed70b 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,20 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold.
> +
> +collapse_exceed_swap_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one shared PTE.
> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index ba7ae6808544..48496f09909b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 345c54133c83..5c128cdec810 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +
>
>  static struct attribute *anon_stats_attrs[] = {
>  	&anon_fault_alloc_attr.attr,
> @@ -719,6 +723,9 @@ static struct attribute *anon_stats_attrs[] = {
>  	&split_deferred_attr.attr,
>  	&nr_anon_attr.attr,
>  	&nr_anon_partially_mapped_attr.attr,
> +	&collapse_exceed_swap_pte_attr.attr,
> +	&collapse_exceed_none_pte_attr.attr,
> +	&collapse_exceed_shared_pte_attr.attr,
>  	NULL,
>  };
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 928e32a0d4d7..fff6a8fbf1d4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -649,7 +649,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		if (pte_none_or_zero(pteval)) {
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out;
>  			}
>  			continue;
> @@ -683,9 +685,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> +			/*
> +			 * TODO: Support shared pages without leading to further
> +			 * mTHP collapses. Currently bringing in new pages via
> +			 * shared may cause a future higher order collapse on a
> +			 * rescan of the same range.
> +			 */

Seems weird to add this comment in a stats update patch?

Not a massively big deal but maybe worth moving to the relevant commit.

>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;
>  			}
>  		}
> @@ -1138,6 +1148,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (!is_pmd_order(order)) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> @@ -1433,6 +1444,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out_unmap;
>  			}
>  			continue;
> @@ -1441,6 +1454,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  				goto out_unmap;
>  			}
>  			/*
> @@ -1498,6 +1513,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out_unmap;
>  			}
>  		}
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 09/14] mm/khugepaged: improve tracepoints for mTHP orders
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (7 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-22 15:00 ` [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to
give better insight into what order is being operated at for.

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
 mm/khugepaged.c                    |  9 ++++----
 2 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index bcdc57eea270..291fae364c62 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -89,40 +89,44 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 TRACE_EVENT(mm_collapse_huge_page,
 
-	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+	TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order),
 
-	TP_ARGS(mm, isolated, status),
+	TP_ARGS(mm, isolated, status, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, isolated)
 		__field(int, status)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->isolated = isolated;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, isolated=%d, status=%s",
+	TP_printk("mm=%p, isolated=%d, status=%s, order=%u",
 		__entry->mm,
 		__entry->isolated,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct folio *folio, int none_or_zero,
-		 int referenced, int status),
+		 int referenced, int status, unsigned int order),
 
-	TP_ARGS(folio, none_or_zero, referenced, status),
+	TP_ARGS(folio, none_or_zero, referenced, status, order),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
 		__field(int, none_or_zero)
 		__field(int, referenced)
 		__field(int, status)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
@@ -130,26 +134,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->none_or_zero = none_or_zero;
 		__entry->referenced = referenced;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s, order=%u",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_swapin,
 
-	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+		 unsigned int order),
 
-	TP_ARGS(mm, swapped_in, referenced, ret),
+	TP_ARGS(mm, swapped_in, referenced, ret, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, swapped_in)
 		__field(int, referenced)
 		__field(int, ret)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
@@ -157,13 +165,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->swapped_in = swapped_in;
 		__entry->referenced = referenced;
 		__entry->ret = ret;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u",
 		__entry->mm,
 		__entry->swapped_in,
 		__entry->referenced,
-		__entry->ret)
+		__entry->ret,
+		__entry->order)
 );
 
 TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fff6a8fbf1d4..4534025bc81d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -783,13 +783,13 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-						    referenced, result);
+						    referenced, result, order);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-					    referenced, result);
+					    referenced, result, order);
 	return result;
 }
 
@@ -1189,7 +1189,8 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
 
 	result = SCAN_SUCCEED;
 out:
-	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+					   order);
 	return result;
 }
 
@@ -1397,7 +1398,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 out_nolock:
 	if (folio)
 		folio_put(folio);
-	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
 	return result;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (8 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 09/14] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-31 20:18   ` David Hildenbrand (Arm)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

Add collapse_allowable_orders() to generalize THP order eligibility. The
function determines which THP orders are permitted based on collapse
context (khugepaged vs madv_collapse).

This consolidates collapse configuration logic and provides a clean
interface for future mTHP collapse support where the orders may be
different.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4534025bc81d..64ceebc9d8a7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -552,12 +552,21 @@ void __khugepaged_enter(struct mm_struct *mm)
 		wake_up_interruptible(&khugepaged_wait);
 }
 
+/* Check what orders are allowed based on the vma and collapse type */
+static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
+		vm_flags_t vm_flags, enum tva_type tva_flags)
+{
+	unsigned long orders = BIT(HPAGE_PMD_ORDER);
+
+	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+}
+
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -2680,7 +2689,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
 			cc->progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
 			cc->progress++;
 			continue;
 		}
@@ -2989,7 +2998,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
 
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+	if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
 		return -EINVAL;
 
 	cc = kmalloc_obj(*cc);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
  2026-05-22 15:00 ` [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
@ 2026-05-31 20:18   ` David Hildenbrand (Arm)
  2026-06-01 14:35     ` Lorenzo Stoakes
  0 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-31 20:18 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 5/22/26 17:00, Nico Pache wrote:
> Add collapse_allowable_orders() to generalize THP order eligibility. The
> function determines which THP orders are permitted based on collapse
> context (khugepaged vs madv_collapse).
> 
> This consolidates collapse configuration logic and provides a clean
> interface for future mTHP collapse support where the orders may be
> different.

It would have been good to describe here that, for now, it only ever returns
PMDs, and that it will be extended next.

Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit

... and there, it is clear that we don't even want to know the orders?

So can we just call this function

"collapse_possible" and make it return a boolean?

> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4534025bc81d..64ceebc9d8a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -552,12 +552,21 @@ void __khugepaged_enter(struct mm_struct *mm)
>  		wake_up_interruptible(&khugepaged_wait);
>  }
>  
> +/* Check what orders are allowed based on the vma and collapse type */
> +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> +		vm_flags_t vm_flags, enum tva_type tva_flags)
> +{
> +	unsigned long orders = BIT(HPAGE_PMD_ORDER);
> +
> +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +}
> +
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -2680,7 +2689,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>  			cc->progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
>  			cc->progress++;
>  			continue;
>  		}
> @@ -2989,7 +2998,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	BUG_ON(vma->vm_start > start);
>  	BUG_ON(vma->vm_end < end);
>  
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> +	if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
>  		return -EINVAL;
>  
>  	cc = kmalloc_obj(*cc);

Having a simple

static bool collapse_possible(...)
{
	return collapse_allowable_orders(...)
}

Would make the above slightly more readable.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
  2026-05-31 20:18   ` David Hildenbrand (Arm)
@ 2026-06-01 14:35     ` Lorenzo Stoakes
  2026-06-01 14:40       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-01 14:35 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Sun, May 31, 2026 at 10:18:16PM +0200, David Hildenbrand (Arm) wrote:
> On 5/22/26 17:00, Nico Pache wrote:
> > Add collapse_allowable_orders() to generalize THP order eligibility. The
> > function determines which THP orders are permitted based on collapse
> > context (khugepaged vs madv_collapse).
> >
> > This consolidates collapse configuration logic and provides a clean
> > interface for future mTHP collapse support where the orders may be
> > different.
>
> It would have been good to describe here that, for now, it only ever returns
> PMDs, and that it will be extended next.
>
> Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit
>
> ... and there, it is clear that we don't even want to know the orders?
>
> So can we just call this function
>
> "collapse_possible" and make it return a boolean?

Yeah agreed.

But I also don't love the naming, we now have thp_vma_allowable_orders(),
__thp_vma_allowable_orders(), and then collapse_allowable_orders() and we also
have 3 different ways of collapsing, one of which we call MADV_... COLLAPSE,
and the other khugepaged + fault-in too for laughs.

It's like a big circle of confusion.

And of course we call THP collapse 'collapse' in general, so :)

Anyway I'm fine with collapse_possible() so we can move on and then maybe
cleanup later.

Also - if I look at khugepaged.c I see thp_vma_allowable_orders() still used in
hugepage_vma_revalidate().

Wouldn't it be better then to do the abstraction once mTHP order checking is
properly introduced and change this also and have _every_ order check be
consistent in khugepaged.c?

>
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 4534025bc81d..64ceebc9d8a7 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -552,12 +552,21 @@ void __khugepaged_enter(struct mm_struct *mm)
> >  		wake_up_interruptible(&khugepaged_wait);
> >  }
> >
> > +/* Check what orders are allowed based on the vma and collapse type */

I'd expand this comment to explain that it's explicitly for accounting for
whether mTHP is used, but that also argues for this to be moved to a later
commit as David says.

Otherwise the comment is useless.

> > +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> > +		vm_flags_t vm_flags, enum tva_type tva_flags)
> > +{
> > +	unsigned long orders = BIT(HPAGE_PMD_ORDER);

Could be a const also.

> > +
> > +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> > +}
> > +
> >  void khugepaged_enter_vma(struct vm_area_struct *vma,
> >  			  vm_flags_t vm_flags)
> >  {
> >  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> >  	    hugepage_pmd_enabled()) {
> > -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> > +		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))

I hate that we separate out the VMA flags like this just for this case, but
that's something for a follow up probably from me as part of a VMA flags
conversion series...

> >  			__khugepaged_enter(vma->vm_mm);
> >  	}
> >  }
> > @@ -2680,7 +2689,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
> >  			cc->progress++;
> >  			break;
> >  		}
> > -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> > +		if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
> >  			cc->progress++;
> >  			continue;
> >  		}
> > @@ -2989,7 +2998,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >  	BUG_ON(vma->vm_start > start);
> >  	BUG_ON(vma->vm_end < end);
> >
> > -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> > +	if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
> >  		return -EINVAL;
> >
> >  	cc = kmalloc_obj(*cc);
>
> Having a simple
>
> static bool collapse_possible(...)
> {
> 	return collapse_allowable_orders(...)
> }
>
> Would make the above slightly more readable.

Yup.

>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
  2026-06-01 14:35     ` Lorenzo Stoakes
@ 2026-06-01 14:40       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01 14:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 6/1/26 16:35, Lorenzo Stoakes wrote:
> On Sun, May 31, 2026 at 10:18:16PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/22/26 17:00, Nico Pache wrote:
>>> Add collapse_allowable_orders() to generalize THP order eligibility. The
>>> function determines which THP orders are permitted based on collapse
>>> context (khugepaged vs madv_collapse).
>>>
>>> This consolidates collapse configuration logic and provides a clean
>>> interface for future mTHP collapse support where the orders may be
>>> different.
>>
>> It would have been good to describe here that, for now, it only ever returns
>> PMDs, and that it will be extended next.
>>
>> Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit
>>
>> ... and there, it is clear that we don't even want to know the orders?
>>
>> So can we just call this function
>>
>> "collapse_possible" and make it return a boolean?

FWIW, I realized later that #11 has

enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);

and forgot to delete that comment.

So we must have a variant (for #11) that returns the enabled orders.

> 
> Yeah agreed.
> 
> But I also don't love the naming, we now have thp_vma_allowable_orders(),
> __thp_vma_allowable_orders(), and then collapse_allowable_orders() and we also
> have 3 different ways of collapsing, one of which we call MADV_... COLLAPSE,
> and the other khugepaged + fault-in too for laughs.
> 
> It's like a big circle of confusion.
> 
> And of course we call THP collapse 'collapse' in general, so :)
> 
> Anyway I'm fine with collapse_possible() so we can move on and then maybe
> cleanup later.

We could simply have

collapse_possible_orders()

and

collapse_possible()

the latter being a simple wrapper around collapse_possible_orders().

> 
> Also - if I look at khugepaged.c I see thp_vma_allowable_orders() still used in
> hugepage_vma_revalidate().
> 
> Wouldn't it be better then to do the abstraction once mTHP order checking is
> properly introduced and change this also and have _every_ order check be
> consistent in khugepaged.c?

That'd also be nice, if easily possible.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (9 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-25 14:15   ` Nico Pache
                     ` (3 more replies)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
                   ` (6 subsequent siblings)
  17 siblings, 4 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

Enable khugepaged to collapse to mTHP orders. This patch implements the
main scanning logic using a bitmap to track occupied pages and a stack
structure that allows us to find optimal collapse sizes.

Previous to this patch, PMD collapse had 3 main phases, a light weight
scanning phase (mmap_read_lock) that determines a potential PMD
collapse, an alloc phase (mmap unlocked), then finally heavier collapse
phase (mmap_write_lock).

To enabled mTHP collapse we make the following changes:

During PMD scan phase, track occupied pages in a bitmap. When mTHP
orders are enabled, we remove the restriction of max_ptes_none during the
scan phase to avoid missing potential mTHP collapse candidates. Once we
have scanned the full PMD range and updated the bitmap to track occupied
pages, we use the bitmap to find the optimal mTHP size.

Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
and determine the best eligible order for the collapse. A stack structure
is used instead of traditional recursion to manage the search. This also
prevents a traditional recursive approach when the kernel stack struct is
limited. The algorithm recursively splits the bitmap into smaller chunks to
find the highest order mTHPs that satisfy the collapse criteria. We start
by attempting the PMD order, then moved on the consecutively lower orders
(mTHP collapse). The stack maintains a pair of variables (offset, order),
indicating the number of PTEs from the start of the PMD, and the order of
the potential collapse candidate.

The algorithm for consuming the bitmap works as such:
    1) push (0, HPAGE_PMD_ORDER) onto the stack
    2) pop the stack
    3) check if the number of set bits in that (offset,order) pair
       statisfy the max_ptes_none threshold for that order
    4) if yes, attempt collapse
    5) if no (or collapse fails), push two new stack items representing
       the left and right halves of the current bitmap range, at the
       next lower order
    6) repeat at step (2) until stack is empty.

Below is a diagram representing the algorithm and stack items:

                            offset   mid_offset
                            |        |
                            |        |
                            v        v
          ____________________________________
         |          PTE Page Table            |
         --------------------------------------
			    <-------><------->
                             order-1  order-1

mTHP collapses reject regions containing swapped out or shared pages.
This is because adding new entries can lead to new none pages, and these
may lead to constant promotion into a higher order mTHP. A similar
issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
introducing at least 2x the number of pages, and on a future scan will
satisfy the promotion condition once again. This issue is prevented via
the collapse_max_ptes_none() function which imposes the max_ptes_none
restrictions above.

We currently only support mTHP collapse for max_ptes_none values of 0
and HPAGE_PMD_NR - 1. resulting in the following behavior:

    - max_ptes_none=0: Never introduce new empty pages during collapse
    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
      available mTHP order

Any other max_ptes_none value will emit a warning and default mTHP
collapse to max_ptes_none=0. There should be no behavior change for PMD
collapse.

Once we determine what mTHP sizes fits best in that PMD range a collapse
is attempted. A minimum collapse order of 2 is used as this is the lowest
order supported by anon memory as defined by THP_ORDERS_ALL_ANON.

Currently madv_collapse is not supported and will only attempt PMD
collapse.

We can also remove the check for is_khugepaged inside the PMD scan as
the collapse_max_ptes_none() function handles this logic now.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 172 insertions(+), 9 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 64ceebc9d8a7..d3d7db8be26c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+#define KHUGEPAGED_MIN_MTHP_ORDER	2
+/*
+ * mthp_collapse() does an iterative DFS over a binary tree, from
+ * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
+ * size needed for a DFS on a binary tree is height + 1, where
+ * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
+ *
+ * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
+ * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
+ */
+#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
+
+/*
+ * Defines a range of PTE entries in a PTE page table which are being
+ * considered for mTHP collapse.
+ *
+ * @offset: the offset of the first PTE entry in a PMD range.
+ * @order: the order of the PTE entries being considered for collapse.
+ */
+struct mthp_range {
+	u16 offset;
+	u8 order;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -110,6 +134,12 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/* Each bit represents a single occupied (!none/zero) page. */
+	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
+	/* A mask of the current range being considered for mTHP collapse. */
+	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
 };
 
 /**
@@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	return result;
 }
 
+static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
+				     u16 offset, u8 order)
+{
+	const int size = *stack_size;
+	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
+
+	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
+	stack->order = order;
+	stack->offset = offset;
+	(*stack_size)++;
+}
+
+static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
+						 int *stack_size)
+{
+	const int size = *stack_size;
+
+	VM_WARN_ON_ONCE(size <= 0);
+	(*stack_size)--;
+	return cc->mthp_bitmap_stack[size - 1];
+}
+
+static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
+						u16 offset, unsigned int nr_ptes)
+{
+	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
+	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+}
+
+/*
+ * mthp_collapse() consumes the bitmap that is generated during
+ * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
+ *
+ * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
+ * A stack structure cc->mthp_bitmap_stack is used to check different regions
+ * of the bitmap for collapse eligibility. The stack maintains a pair of
+ * variables (offset, order), indicating the number of PTEs from the start of
+ * the PMD, and the order of the potential collapse candidate respectively. We
+ * start at the PMD order and check if it is eligible for collapse; if not, we
+ * add two entries to the stack at a lower order to represent the left and right
+ * halves of the PTE page table we are examining.
+ *
+ *                         offset       mid_offset
+ *                         |         |
+ *                         |         |
+ *                         v         v
+ *      --------------------------------------
+ *      |          cc->mthp_bitmap            |
+ *      --------------------------------------
+ *                         <-------><------->
+ *                          order-1  order-1
+ *
+ * For each of these, we determine how many PTE entries are occupied in the
+ * range of PTE entries we propose to collapse, then we compare this to a
+ * threshold number of PTE entries which would need to be occupied for a
+ * collapse to be permitted at that order (accounting for max_ptes_none).
+ *
+ * If a collapse is permitted, we attempt to collapse the PTE range into a
+ * mTHP.
+ */
+static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, int referenced, int unmapped,
+		struct collapse_control *cc, unsigned long enabled_orders)
+{
+	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
+	int collapsed = 0, stack_size = 0;
+	unsigned long collapse_address;
+	struct mthp_range range;
+	u16 offset;
+	u8 order;
+
+	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
+
+	while (stack_size) {
+		range = collapse_mthp_stack_pop(cc, &stack_size);
+		order = range.order;
+		offset = range.offset;
+		nr_ptes = 1UL << order;
+
+		if (!test_bit(order, &enabled_orders))
+			goto next_order;
+
+		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+
+		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
+							       nr_ptes);
+
+		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
+			int ret;
+
+			collapse_address = address + offset * PAGE_SIZE;
+			ret = collapse_huge_page(mm, collapse_address, referenced,
+						 unmapped, cc, order);
+			if (ret == SCAN_SUCCEED) {
+				collapsed += nr_ptes;
+				continue;
+			}
+		}
+
+next_order:
+		if ((BIT(order) - 1) & enabled_orders) {
+			const u8 next_order = order - 1;
+			const u16 mid_offset = offset + (nr_ptes / 2);
+
+			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
+						 next_order);
+			collapse_mthp_stack_push(cc, &stack_size, offset,
+						 next_order);
+		}
+	}
+	return collapsed;
+}
+
 static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		bool *lock_dropped, struct collapse_control *cc)
 {
-	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
 	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
 	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
+	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
+	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
 	pmd_t *pmd;
-	pte_t *pte, *_pte;
-	int none_or_zero = 0, shared = 0, referenced = 0;
+	pte_t *pte, *_pte, pteval;
+	int i;
+	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
 	enum scan_result result = SCAN_FAIL;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
 
@@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
+	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
+
+	/*
+	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
+	 * scan all pages to populate the bitmap for mTHP collapse.
+	 */
+	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
+		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
+
 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
 	if (!pte) {
 		cc->progress++;
@@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
-	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, addr += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		_pte = pte + i;
+		addr = start_addr + i * PAGE_SIZE;
+		pteval = ptep_get(_pte);
+
 		cc->progress++;
 
-		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
 			if (++none_or_zero > max_ptes_none) {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 			}
 		}
 
+		/* Set bit for occupied pages */
+		__set_bit(i, cc->mthp_bitmap);
 		/*
 		 * Record which node the original page is from and save this
 		 * information to cc->node_load[].
@@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 	if (result == SCAN_SUCCEED) {
 		/* collapse_huge_page expects the lock to be dropped before calling */
 		mmap_read_unlock(mm);
-		result = collapse_huge_page(mm, start_addr, referenced,
-					    unmapped, cc, HPAGE_PMD_ORDER);
-		/* collapse_huge_page will return with the mmap_lock released */
+		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
+					     unmapped, cc, enabled_orders);
+		/* mmap_lock was released above, set lock_dropped */
 		*lock_dropped = true;
+		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
@ 2026-05-25 14:15   ` Nico Pache
  2026-05-25 19:10     ` Andrew Morton
  2026-05-31  7:18   ` Lance Yang
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-25 14:15 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
  Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe


On 5/22/26 9:00 AM, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
>      1) push (0, HPAGE_PMD_ORDER) onto the stack
>      2) pop the stack
>      3) check if the number of set bits in that (offset,order) pair
>         statisfy the max_ptes_none threshold for that order
>      4) if yes, attempt collapse
>      5) if no (or collapse fails), push two new stack items representing
>         the left and right halves of the current bitmap range, at the
>         next lower order
>      6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
>                              offset   mid_offset
>                              |        |
>                              |        |
>                              v        v
>            ____________________________________
>           |          PTE Page Table            |
>           --------------------------------------
> 			    <-------><------->
>                               order-1  order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>      - max_ptes_none=0: Never introduce new empty pages during collapse
>      - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>        available mTHP order
>
> Any other max_ptes_none value will emit a warning and default mTHP
> collapse to max_ptes_none=0. There should be no behavior change for PMD
> collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 172 insertions(+), 9 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 64ceebc9d8a7..d3d7db8be26c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>   
>   static struct kmem_cache *mm_slot_cache __ro_after_init;
>   
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>   struct collapse_control {
>   	bool is_khugepaged;
>   
> @@ -110,6 +134,12 @@ struct collapse_control {
>   
>   	/* nodemask for allocation fallback */
>   	nodemask_t alloc_nmask;
> +
> +	/* Each bit represents a single occupied (!none/zero) page. */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> +	/* A mask of the current range being considered for mTHP collapse. */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>   };
>   
>   /**
> @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>   	return result;
>   }
>   
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				     u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> +						 int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> +						u16 offset, unsigned int nr_ptes)
> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, int referenced, int unmapped,
> +		struct collapse_control *cc, unsigned long enabled_orders)
> +{
> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> +	int collapsed = 0, stack_size = 0;
> +	unsigned long collapse_address;
> +	struct mthp_range range;
> +	u16 offset;
> +	u8 order;
> +
> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size) {
> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_ptes = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +
> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> +							       nr_ptes);
> +
> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_ptes;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if ((BIT(order) - 1) & enabled_orders) {
> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_ptes / 2);
> +
> +			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> +						 next_order);
> +			collapse_mthp_stack_push(cc, &stack_size, offset,
> +						 next_order);
> +		}
> +	}
> +	return collapsed;
> +}

Hi Andrew,

Can you please append the following fixup that reverts one of the

changes requested in V17. The issue with the change is described

below.


commit 1e099144dfcdd28e3b3b50b32535798db53866aa
Author: Nico Pache <npache@redhat.com>
Date:   Mon May 25 07:38:59 2026 -0600

     fixup: fix potential use-after-free of vma in mthp_collapse()

     Between V17 and v18, one reviewer (Wei) brought up that we are not 
doing
     the uffd-armed check until deep in the collapse operation. While not
     functionally incorrect, it can lead to unnecessary work.

     We optimized this by passing the vma variable to mthp_collapse() 
and using
     the collapse_max_ptes_none() function to check the state of uffd-armed
     preventing the wasted work later in the collapse.

     mthp_collapse() is called after mmap_read_unlock(), so the vma pointer
     can become stale. Remove the vma parameter and pass NULL to
     collapse_max_ptes_none() instead.

     Signed-off-by: Nico Pache <npache@redhat.com>

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3d7db8be26c..a901db5c9201 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1502,9 +1502,9 @@ static unsigned int 
collapse_mthp_count_present(struct collapse_control *cc,
   * If a collapse is permitted, we attempt to collapse the PTE range into a
   * mTHP.
   */
-static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
-        unsigned long address, int referenced, int unmapped,
-        struct collapse_control *cc, unsigned long enabled_orders)
+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
+        int referenced, int unmapped, struct collapse_control *cc,
+        unsigned long enabled_orders)
  {
      unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
      int collapsed = 0, stack_size = 0;
@@ -1524,7 +1524,7 @@ static int mthp_collapse(struct mm_struct *mm, 
struct vm_area_struct *vma,
          if (!test_bit(order, &enabled_orders))
              goto next_order;

-        max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+        max_ptes_none = collapse_max_ptes_none(cc, NULL, order);

          nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
                                     nr_ptes);
@@ -1749,7 +1749,7 @@ static enum scan_result collapse_scan_pmd(struct 
mm_struct *mm,
      if (result == SCAN_SUCCEED) {
          /* collapse_huge_page expects the lock to be dropped before 
calling */
          mmap_read_unlock(mm);
-        nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
+        nr_collapsed = mthp_collapse(mm, start_addr, referenced,
                           unmapped, cc, enabled_orders);
          /* mmap_lock was released above, set lock_dropped */
          *lock_dropped = true;


> +
>   static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   		struct vm_area_struct *vma, unsigned long start_addr,
>   		bool *lock_dropped, struct collapse_control *cc)
>   {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>   	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>   	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> +	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>   	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	pte_t *pte, *_pte, pteval;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>   	enum scan_result result = SCAN_FAIL;
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
>   	unsigned long addr;
> +	unsigned long enabled_orders;
>   	spinlock_t *ptl;
>   	int node = NUMA_NO_NODE, unmapped = 0;
>   
> @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   		goto out;
>   	}
>   
> +	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>   	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */
> +	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> +
>   	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>   	if (!pte) {
>   		cc->progress++;
> @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   		goto out;
>   	}
>   
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;
> +		addr = start_addr + i * PAGE_SIZE;
> +		pteval = ptep_get(_pte);
> +
>   		cc->progress++;
>   
> -		pte_t pteval = ptep_get(_pte);
>   		if (pte_none_or_zero(pteval)) {
>   			if (++none_or_zero > max_ptes_none) {
>   				result = SCAN_EXCEED_NONE_PTE;
> @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   			}
>   		}
>   
> +		/* Set bit for occupied pages */
> +		__set_bit(i, cc->mthp_bitmap);
>   		/*
>   		 * Record which node the original page is from and save this
>   		 * information to cc->node_load[].
> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   	if (result == SCAN_SUCCEED) {
>   		/* collapse_huge_page expects the lock to be dropped before calling */
>   		mmap_read_unlock(mm);
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, HPAGE_PMD_ORDER);
> -		/* collapse_huge_page will return with the mmap_lock released */
> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> +					     unmapped, cc, enabled_orders);
> +		/* mmap_lock was released above, set lock_dropped */
>   		*lock_dropped = true;
> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>   	}
>   out:
>   	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-25 14:15   ` Nico Pache
@ 2026-05-25 19:10     ` Andrew Morton
  2026-05-26  6:57       ` Wei Yang
  0 siblings, 1 reply; 114+ messages in thread
From: Andrew Morton @ 2026-05-25 19:10 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:

> Can you please append the following fixup that reverts one of the
> changes requested in V17. The issue with the change is described
> below.

OK.  fyi, what I received was badly mangled: wordwrapping, tabs messed
up, etc.

Here's my reconstruction:


Author: Nico Pache <npache@redhat.com>
Subject: fix potential use-after-free of vma in mthp_collapse()
Date: Mon May 25 07:38:59 2026 -0600

Between V17 and v18, one reviewer (Wei) brought up that we are not doing
the uffd-armed check until deep in the collapse operation.  While not
functionally incorrect, it can lead to unnecessary work.

We optimized this by passing the vma variable to mthp_collapse() and using
the collapse_max_ptes_none() function to check the state of uffd-armed
preventing the wasted work later in the collapse.

mthp_collapse() is called after mmap_read_unlock(), so the vma pointer can
become stale.  Remove the vma parameter and pass NULL to
collapse_max_ptes_none() instead.

Link: https://lore.kernel.org/2b2cda8c-358a-4a5c-989c-ae42593ef2ea@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
...

 mm/khugepaged.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/mm/khugepaged.c~mm-khugepaged-introduce-mthp-collapse-support-fix
+++ a/mm/khugepaged.c
@@ -1502,9 +1502,9 @@ static unsigned int collapse_mthp_count_
  * If a collapse is permitted, we attempt to collapse the PTE range into a
  * mTHP.
  */
-static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, int referenced, int unmapped,
-		struct collapse_control *cc, unsigned long enabled_orders)
+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
+		int referenced, int unmapped, struct collapse_control *cc,
+		unsigned long enabled_orders)
 {
 	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
 	int collapsed = 0, stack_size = 0;
@@ -1524,7 +1524,7 @@ static int mthp_collapse(struct mm_struc
 		if (!test_bit(order, &enabled_orders))
 			goto next_order;
 
-		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
 
 		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
 							       nr_ptes);
@@ -1749,7 +1749,7 @@ out_unmap:
 	if (result == SCAN_SUCCEED) {
 		/* collapse_huge_page expects the lock to be dropped before calling */
 		mmap_read_unlock(mm);
-		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
+		nr_collapsed = mthp_collapse(mm, start_addr, referenced,
 					     unmapped, cc, enabled_orders);
 		/* mmap_lock was released above, set lock_dropped */
 		*lock_dropped = true;
_


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-25 19:10     ` Andrew Morton
@ 2026-05-26  6:57       ` Wei Yang
  2026-05-26 12:07         ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: Wei Yang @ 2026-05-26  6:57 UTC (permalink / raw)
  To: Nico Pache, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
>On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> Can you please append the following fixup that reverts one of the
>> changes requested in V17. The issue with the change is described
>> below.
>
>OK.  fyi, what I received was badly mangled: wordwrapping, tabs messed
>up, etc.
>
>Here's my reconstruction:
>

Hi, Nico

I tried to reply your mail, but found it has some encoding problem, so reply
here.

>
>Author: Nico Pache <npache@redhat.com>
>Subject: fix potential use-after-free of vma in mthp_collapse()
>Date: Mon May 25 07:38:59 2026 -0600
>
>Between V17 and v18, one reviewer (Wei) brought up that we are not doing
>the uffd-armed check until deep in the collapse operation.  While not
>functionally incorrect, it can lead to unnecessary work.

So we decide to tolerate the behavioral change?

>
>We optimized this by passing the vma variable to mthp_collapse() and using
>the collapse_max_ptes_none() function to check the state of uffd-armed
>preventing the wasted work later in the collapse.
>
>mthp_collapse() is called after mmap_read_unlock(), so the vma pointer can
>become stale.  Remove the vma parameter and pass NULL to
>collapse_max_ptes_none() instead.
>
>Link: https://lore.kernel.org/2b2cda8c-358a-4a5c-989c-ae42593ef2ea@redhat.com
>Signed-off-by: Nico Pache <npache@redhat.com>
>...
>
> mm/khugepaged.c |   10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
>--- a/mm/khugepaged.c~mm-khugepaged-introduce-mthp-collapse-support-fix
>+++ a/mm/khugepaged.c
>@@ -1502,9 +1502,9 @@ static unsigned int collapse_mthp_count_
>  * If a collapse is permitted, we attempt to collapse the PTE range into a
>  * mTHP.
>  */
>-static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>-		unsigned long address, int referenced, int unmapped,
>-		struct collapse_control *cc, unsigned long enabled_orders)
>+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned long enabled_orders)
> {
> 	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> 	int collapsed = 0, stack_size = 0;
>@@ -1524,7 +1524,7 @@ static int mthp_collapse(struct mm_struc
> 		if (!test_bit(order, &enabled_orders))
> 			goto next_order;
> 
>-		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
>+		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> 
> 		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> 							       nr_ptes);
>@@ -1749,7 +1749,7 @@ out_unmap:
> 	if (result == SCAN_SUCCEED) {
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
>-		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>+		nr_collapsed = mthp_collapse(mm, start_addr, referenced,
> 					     unmapped, cc, enabled_orders);
> 		/* mmap_lock was released above, set lock_dropped */
> 		*lock_dropped = true;
>_

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-26  6:57       ` Wei Yang
@ 2026-05-26 12:07         ` Nico Pache
  2026-05-28  8:42           ` Wei Yang
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-26 12:07 UTC (permalink / raw)
  To: Wei Yang
  Cc: Andrew Morton, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
	baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
	david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
	jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Tue, May 26, 2026 at 12:57 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
> >On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> >> Can you please append the following fixup that reverts one of the
> >> changes requested in V17. The issue with the change is described
> >> below.
> >
> >OK.  fyi, what I received was badly mangled: wordwrapping, tabs messed
> >up, etc.
> >
> >Here's my reconstruction:
> >
>
> Hi, Nico
>
> I tried to reply your mail, but found it has some encoding problem, so reply
> here.

Yeah sorry I didnt properly configure my email client after getting a
new laptop.

>
> >
> >Author: Nico Pache <npache@redhat.com>
> >Subject: fix potential use-after-free of vma in mthp_collapse()
> >Date: Mon May 25 07:38:59 2026 -0600
> >
> >Between V17 and v18, one reviewer (Wei) brought up that we are not doing
> >the uffd-armed check until deep in the collapse operation.  While not
> >functionally incorrect, it can lead to unnecessary work.
>
> So we decide to tolerate the behavioral change?

Yes, I believe it is ok for now. Either way we needed to remove the
potential UAF. It only affects the behavior if mTHP is enabled, so the
legacy behavior is kept. And the uffd case is limited.

My future work involves further optimizing and cleaning up khugepaged.
I'll make this part of the goal too. My first thought is to do the
revalidation at every order (between the locks dropping); but that
essentially pays the same penalty... I can't think of a clean solution
at the moment.

Does that sound ok?

Cheers,
-- Nico


-- Nico

>
> >
> >We optimized this by passing the vma variable to mthp_collapse() and using
> >the collapse_max_ptes_none() function to check the state of uffd-armed
> >preventing the wasted work later in the collapse.
> >
> >mthp_collapse() is called after mmap_read_unlock(), so the vma pointer can
> >become stale.  Remove the vma parameter and pass NULL to
> >collapse_max_ptes_none() instead.
> >
> >Link: https://lore.kernel.org/2b2cda8c-358a-4a5c-989c-ae42593ef2ea@redhat.com
> >Signed-off-by: Nico Pache <npache@redhat.com>
> >...
> >
> > mm/khugepaged.c |   10 +++++-----
> > 1 file changed, 5 insertions(+), 5 deletions(-)
> >
> >--- a/mm/khugepaged.c~mm-khugepaged-introduce-mthp-collapse-support-fix
> >+++ a/mm/khugepaged.c
> >@@ -1502,9 +1502,9 @@ static unsigned int collapse_mthp_count_
> >  * If a collapse is permitted, we attempt to collapse the PTE range into a
> >  * mTHP.
> >  */
> >-static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> >-              unsigned long address, int referenced, int unmapped,
> >-              struct collapse_control *cc, unsigned long enabled_orders)
> >+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >+              int referenced, int unmapped, struct collapse_control *cc,
> >+              unsigned long enabled_orders)
> > {
> >       unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> >       int collapsed = 0, stack_size = 0;
> >@@ -1524,7 +1524,7 @@ static int mthp_collapse(struct mm_struc
> >               if (!test_bit(order, &enabled_orders))
> >                       goto next_order;
> >
> >-              max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> >+              max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >
> >               nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> >                                                              nr_ptes);
> >@@ -1749,7 +1749,7 @@ out_unmap:
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> >-              nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >+              nr_collapsed = mthp_collapse(mm, start_addr, referenced,
> >                                            unmapped, cc, enabled_orders);
> >               /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> >_
>
> --
> Wei Yang
> Help you, Help me
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-26 12:07         ` Nico Pache
@ 2026-05-28  8:42           ` Wei Yang
  2026-05-28 17:11             ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: Wei Yang @ 2026-05-28  8:42 UTC (permalink / raw)
  To: Nico Pache
  Cc: Wei Yang, Andrew Morton, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
	baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
	david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
	jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Tue, May 26, 2026 at 06:07:38AM -0600, Nico Pache wrote:
>On Tue, May 26, 2026 at 12:57 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
>> >On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
>> >
>> >> Can you please append the following fixup that reverts one of the
>> >> changes requested in V17. The issue with the change is described
>> >> below.
>> >
>> >OK.  fyi, what I received was badly mangled: wordwrapping, tabs messed
>> >up, etc.
>> >
>> >Here's my reconstruction:
>> >
>>
>> Hi, Nico
>>
>> I tried to reply your mail, but found it has some encoding problem, so reply
>> here.
>
>Yeah sorry I didnt properly configure my email client after getting a
>new laptop.
>
>>
>> >
>> >Author: Nico Pache <npache@redhat.com>
>> >Subject: fix potential use-after-free of vma in mthp_collapse()
>> >Date: Mon May 25 07:38:59 2026 -0600
>> >
>> >Between V17 and v18, one reviewer (Wei) brought up that we are not doing
>> >the uffd-armed check until deep in the collapse operation.  While not
>> >functionally incorrect, it can lead to unnecessary work.
>>
>> So we decide to tolerate the behavioral change?
>
>Yes, I believe it is ok for now. Either way we needed to remove the
>potential UAF. It only affects the behavior if mTHP is enabled, so the
>legacy behavior is kept. And the uffd case is limited.
>
>My future work involves further optimizing and cleaning up khugepaged.
>I'll make this part of the goal too. My first thought is to do the
>revalidation at every order (between the locks dropping); but that
>essentially pays the same penalty... I can't think of a clean solution
>at the moment.

One way come into my mind is add a @was_uffd_armed field in collapse_control
and updates it in hugepage_vma_revalidate() when latest vma is retrieved.

Still not elegant enough.

>
>Does that sound ok?
>

Not sure. I can't imagine the impact it would have.

>Cheers,
>-- Nico


-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-28  8:42           ` Wei Yang
@ 2026-05-28 17:11             ` Nico Pache
  0 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-28 17:11 UTC (permalink / raw)
  To: Wei Yang
  Cc: Andrew Morton, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
	baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
	david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
	jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Thu, May 28, 2026 at 2:42 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, May 26, 2026 at 06:07:38AM -0600, Nico Pache wrote:
> >On Tue, May 26, 2026 at 12:57 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
> >> >On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
> >> >
> >> >> Can you please append the following fixup that reverts one of the
> >> >> changes requested in V17. The issue with the change is described
> >> >> below.
> >> >
> >> >OK.  fyi, what I received was badly mangled: wordwrapping, tabs messed
> >> >up, etc.
> >> >
> >> >Here's my reconstruction:
> >> >
> >>
> >> Hi, Nico
> >>
> >> I tried to reply your mail, but found it has some encoding problem, so reply
> >> here.
> >
> >Yeah sorry I didnt properly configure my email client after getting a
> >new laptop.
> >
> >>
> >> >
> >> >Author: Nico Pache <npache@redhat.com>
> >> >Subject: fix potential use-after-free of vma in mthp_collapse()
> >> >Date: Mon May 25 07:38:59 2026 -0600
> >> >
> >> >Between V17 and v18, one reviewer (Wei) brought up that we are not doing
> >> >the uffd-armed check until deep in the collapse operation.  While not
> >> >functionally incorrect, it can lead to unnecessary work.
> >>
> >> So we decide to tolerate the behavioral change?
> >
> >Yes, I believe it is ok for now. Either way we needed to remove the
> >potential UAF. It only affects the behavior if mTHP is enabled, so the
> >legacy behavior is kept. And the uffd case is limited.
> >
> >My future work involves further optimizing and cleaning up khugepaged.
> >I'll make this part of the goal too. My first thought is to do the
> >revalidation at every order (between the locks dropping); but that
> >essentially pays the same penalty... I can't think of a clean solution
> >at the moment.
>
> One way come into my mind is add a @was_uffd_armed field in collapse_control
> and updates it in hugepage_vma_revalidate() when latest vma is retrieved.
>
> Still not elegant enough.

So our issue is that userfaultfd_armed is at the VMA granularity.
Ideally we want PMD/PTE granularity, but we only have that for wp. I'm
just still investigating all the nuances of uffd and its interactions
with khugepaged (something I've been meaning to understand more of
anyway). But from what i understand so far we actually can use the
bitmap and the was_uffd_armed to optimize this further. It solves the
issue and has a rather small race window, which can just be handled by
the revalidation later on, probably eliminating most of the potential
cases.

IIUC, filling a region with previously empty/zero pages is only an
issue for MODE_MISSING and MODE_WP with WP_UNPOPULATED set as well. I
have a work in progress commit to improve all this uffd handling.

I think what i have is a good middle ground. It improves the current
functionality and closes this gap we have with the new mthp_collapse--
best of both worlds. If the race window is hit, we will pay the
penalty, but that should be greatly reduced. I will send out an RFC
for this targeting mm-new once I have everything verified and cleaned
up :)

Cheers,
-- Nico



>
> >
> >Does that sound ok?
> >
>
> Not sure. I can't imagine the impact it would have.
>
> >Cheers,
> >-- Nico
>
>
> --
> Wei Yang
> Help you, Help me
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
  2026-05-25 14:15   ` Nico Pache
@ 2026-05-31  7:18   ` Lance Yang
  2026-05-31  8:48     ` Lance Yang
  2026-06-02 10:58     ` Nico Pache
  2026-06-01  8:11   ` David Hildenbrand (Arm)
  2026-06-04 14:45   ` Lorenzo Stoakes
  3 siblings, 2 replies; 114+ messages in thread
From: Lance Yang @ 2026-05-31  7:18 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe


On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
[...]
>@@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 	if (result == SCAN_SUCCEED) {
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
>-		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc, HPAGE_PMD_ORDER);
>-		/* collapse_huge_page will return with the mmap_lock released */
>+		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>+					     unmapped, cc, enabled_orders);
>+		/* mmap_lock was released above, set lock_dropped */
> 		*lock_dropped = true;
>+		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;

Hmm ... don't we lose the allocation-failure result here?

Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
in khugepaged_do_scan().

Now if allocation fails and nr_collapsed stays 0, we just return
SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Cheers, Lance

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-31  7:18   ` Lance Yang
@ 2026-05-31  8:48     ` Lance Yang
  2026-06-01 12:01       ` Nico Pache
  2026-06-02 10:58     ` Nico Pache
  1 sibling, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-05-31  8:48 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe



On 2026/5/31 15:18, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> [...]
>> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> 	if (result == SCAN_SUCCEED) {
>> 		/* collapse_huge_page expects the lock to be dropped before calling */
>> 		mmap_read_unlock(mm);
>> -		result = collapse_huge_page(mm, start_addr, referenced,
>> -					    unmapped, cc, HPAGE_PMD_ORDER);
>> -		/* collapse_huge_page will return with the mmap_lock released */
>> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>> +					     unmapped, cc, enabled_orders);
>> +		/* mmap_lock was released above, set lock_dropped */
>> 		*lock_dropped = true;
>> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> 
> Hmm ... don't we lose the allocation-failure result here?
> 
> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> in khugepaged_do_scan().
> 
> Now if allocation fails and nr_collapsed stays 0, we just return
> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Looks like this is a more general issue with mthp_collapse() only
returning nr_collapsed.

For example, SCAN_PMD_MAPPED used to be propagated too, and
madvise_collapse() treats that as success. With the new code, if
nothing was collapsed by this call, that can also become SCAN_FAIL ...

So I think we should keep both.

Cheers, Lance


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-31  8:48     ` Lance Yang
@ 2026-06-01 12:01       ` Nico Pache
  2026-06-01 12:06         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-06-01 12:01 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Sun, May 31, 2026 at 2:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/5/31 15:18, Lance Yang wrote:
> >
> > On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> > [...]
> >> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >>      if (result == SCAN_SUCCEED) {
> >>              /* collapse_huge_page expects the lock to be dropped before calling */
> >>              mmap_read_unlock(mm);
> >> -            result = collapse_huge_page(mm, start_addr, referenced,
> >> -                                        unmapped, cc, HPAGE_PMD_ORDER);
> >> -            /* collapse_huge_page will return with the mmap_lock released */
> >> +            nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >> +                                         unmapped, cc, enabled_orders);
> >> +            /* mmap_lock was released above, set lock_dropped */
> >>              *lock_dropped = true;
> >> +            result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> >
> > Hmm ... don't we lose the allocation-failure result here?
> >
> > Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> > collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> > in khugepaged_do_scan().
> >
> > Now if allocation fails and nr_collapsed stays 0, we just return
> > SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
>
> Looks like this is a more general issue with mthp_collapse() only
> returning nr_collapsed.
>
> For example, SCAN_PMD_MAPPED used to be propagated too, and
> madvise_collapse() treats that as success. With the new code, if
> nothing was collapsed by this call, that can also become SCAN_FAIL ...
>
> So I think we should keep both.

Yeah I thought about this before, but more regarding the "incorrect"
propagation of errors; I didn't consider that those results were
actually being considered.

I actually had a patch to track the last_failure (with some
prioritization on certain results). I think that would solve this
issue.

Thanks for reminding me to improve this.

Depending on how the rest of the reviews go, I can either send up a
follow up series to do some more cleanups and improvements of the
current approach or we can send out a v19.

Cheers,
-- Nico

>
> Cheers, Lance
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-01 12:01       ` Nico Pache
@ 2026-06-01 12:06         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01 12:06 UTC (permalink / raw)
  To: Nico Pache, Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On 6/1/26 14:01, Nico Pache wrote:
> On Sun, May 31, 2026 at 2:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2026/5/31 15:18, Lance Yang wrote:
>>>
>>> [...]
>>>
>>> Hmm ... don't we lose the allocation-failure result here?
>>>
>>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
>>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
>>> in khugepaged_do_scan().
>>>
>>> Now if allocation fails and nr_collapsed stays 0, we just return
>>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
>>
>> Looks like this is a more general issue with mthp_collapse() only
>> returning nr_collapsed.
>>
>> For example, SCAN_PMD_MAPPED used to be propagated too, and
>> madvise_collapse() treats that as success. With the new code, if
>> nothing was collapsed by this call, that can also become SCAN_FAIL ...
>>
>> So I think we should keep both.
> 
> Yeah I thought about this before, but more regarding the "incorrect"
> propagation of errors; I didn't consider that those results were
> actually being considered.
> 
> I actually had a patch to track the last_failure (with some
> prioritization on certain results). I think that would solve this
> issue.
> 
> Thanks for reminding me to improve this.
> 
> Depending on how the rest of the reviews go, I can either send up a
> follow up series to do some more cleanups and improvements of the
> current approach or we can send out a v19.

Let's do a v19.

Patch #11 might need a bit of work, we can discuss offline if you want.

If we could get a v19 by the end of the week, that would be nice.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-31  7:18   ` Lance Yang
  2026-05-31  8:48     ` Lance Yang
@ 2026-06-02 10:58     ` Nico Pache
  2026-06-02 15:44       ` Lance Yang
  1 sibling, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-06-02 10:58 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> [...]
> >@@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> >-              result = collapse_huge_page(mm, start_addr, referenced,
> >-                                          unmapped, cc, HPAGE_PMD_ORDER);
> >-              /* collapse_huge_page will return with the mmap_lock released */
> >+              nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >+                                           unmapped, cc, enabled_orders);
> >+              /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> >+              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>
> Hmm ... don't we lose the allocation-failure result here?
>
> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> in khugepaged_do_scan().
>
> Now if allocation fails and nr_collapsed stays 0, we just return
> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Ok I did the error propagation! I think I handled both of these cases
you brought up pretty easily.

However I don't know what to do in the following case: We successfully
collapsed some portion of the PMD, but during that process, we also
hit an allocation failure. Is it best to back off entirely? or can we
treat some forward progress as a sign we can continue trying collapses
without sleeping.

Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
successful collapses as the returned value?

This is what I currently have:
done:
    if (collapsed)
        return SCAN_SUCCEED;
    if (alloc_failed)
        return SCAN_ALLOC_HUGE_PAGE_FAIL;

Thanks,
-- Nico

>
> Cheers, Lance
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-02 10:58     ` Nico Pache
@ 2026-06-02 15:44       ` Lance Yang
  2026-06-03  8:05         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-06-02 15:44 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe



On 2026/6/2 18:58, Nico Pache wrote:
> On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
>> [...]
>>> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>>        if (result == SCAN_SUCCEED) {
>>>                /* collapse_huge_page expects the lock to be dropped before calling */
>>>                mmap_read_unlock(mm);
>>> -              result = collapse_huge_page(mm, start_addr, referenced,
>>> -                                          unmapped, cc, HPAGE_PMD_ORDER);
>>> -              /* collapse_huge_page will return with the mmap_lock released */
>>> +              nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>>> +                                           unmapped, cc, enabled_orders);
>>> +              /* mmap_lock was released above, set lock_dropped */
>>>                *lock_dropped = true;
>>> +              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>>
>> Hmm ... don't we lose the allocation-failure result here?
>>
>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
>> in khugepaged_do_scan().
>>
>> Now if allocation fails and nr_collapsed stays 0, we just return
>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
> 
> Ok I did the error propagation! I think I handled both of these cases
> you brought up pretty easily.

Thanks.

> However I don't know what to do in the following case: We successfully
> collapsed some portion of the PMD, but during that process, we also
> hit an allocation failure. Is it best to back off entirely? or can we
> treat some forward progress as a sign we can continue trying collapses
> without sleeping.
> 
> Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
> successful collapses as the returned value?

Thinking out loud, forward progress should win here, the allocation
failure only matter if we made no progress at all?

> This is what I currently have:
> done:
>      if (collapsed)
>          return SCAN_SUCCEED;
>      if (alloc_failed)
>          return SCAN_ALLOC_HUGE_PAGE_FAIL;

I'd go with this ordering :)

Cheers, Lance

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-02 15:44       ` Lance Yang
@ 2026-06-03  8:05         ` David Hildenbrand (Arm)
  2026-06-04 14:40           ` Lorenzo Stoakes
  0 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-03  8:05 UTC (permalink / raw)
  To: Lance Yang, Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On 6/2/26 17:44, Lance Yang wrote:
> 
> 
> On 2026/6/2 18:58, Nico Pache wrote:
>> On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
>>>
>>>
>>> [...]
>>>
>>> Hmm ... don't we lose the allocation-failure result here?
>>>
>>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
>>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
>>> in khugepaged_do_scan().
>>>
>>> Now if allocation fails and nr_collapsed stays 0, we just return
>>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
>>
>> Ok I did the error propagation! I think I handled both of these cases
>> you brought up pretty easily.
> 
> Thanks.
> 
>> However I don't know what to do in the following case: We successfully
>> collapsed some portion of the PMD, but during that process, we also
>> hit an allocation failure. Is it best to back off entirely? or can we
>> treat some forward progress as a sign we can continue trying collapses
>> without sleeping.
>>
>> Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
>> successful collapses as the returned value?
> 
> Thinking out loud, forward progress should win here, the allocation
> failure only matter if we made no progress at all?

Agreed, in the first approach, forward progress makes sense.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-03  8:05         ` David Hildenbrand (Arm)
@ 2026-06-04 14:40           ` Lorenzo Stoakes
  0 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 14:40 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lance Yang, Nico Pache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Wed, Jun 03, 2026 at 10:05:08AM +0200, David Hildenbrand (Arm) wrote:
> On 6/2/26 17:44, Lance Yang wrote:
> >
> >
> > On 2026/6/2 18:58, Nico Pache wrote:
> >> On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
> >>>
> >>>
> >>> [...]
> >>>
> >>> Hmm ... don't we lose the allocation-failure result here?
> >>>
> >>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> >>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> >>> in khugepaged_do_scan().
> >>>
> >>> Now if allocation fails and nr_collapsed stays 0, we just return
> >>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
> >>
> >> Ok I did the error propagation! I think I handled both of these cases
> >> you brought up pretty easily.
> >
> > Thanks.
> >
> >> However I don't know what to do in the following case: We successfully
> >> collapsed some portion of the PMD, but during that process, we also
> >> hit an allocation failure. Is it best to back off entirely? or can we
> >> treat some forward progress as a sign we can continue trying collapses
> >> without sleeping.
> >>
> >> Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
> >> successful collapses as the returned value?
> >
> > Thinking out loud, forward progress should win here, the allocation
> > failure only matter if we made no progress at all?
>
> Agreed, in the first approach, forward progress makes sense.

Sounds sensible to me.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
  2026-05-25 14:15   ` Nico Pache
  2026-05-31  7:18   ` Lance Yang
@ 2026-06-01  8:11   ` David Hildenbrand (Arm)
  2026-06-01 12:40     ` Nico Pache
  2026-06-04 13:53     ` Lorenzo Stoakes
  2026-06-04 14:45   ` Lorenzo Stoakes
  3 siblings, 2 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01  8:11 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 5/22/26 17:00, Nico Pache wrote:

Finally time for the core piece :)

> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
> 
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
> 
> To enabled mTHP collapse we make the following changes:
> 
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
> 
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
> 
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
> 
> Below is a diagram representing the algorithm and stack items:
> 
>                             offset   mid_offset
>                             |        |
>                             |        |
>                             v        v
>           ____________________________________
>          |          PTE Page Table            |
>          --------------------------------------
> 			    <-------><------->
>                              order-1  order-1


Reading this, it is unclear why exactly do we need the stack.

Why can't you work with offset + cur_order?

Initially,

	offset = 0;
	cur_order = HPAGE_PMD_ORDER;

If collapse succeeded, advance to next range.
If collapse failed, try next smaller order, keeping offset unchanged.

	if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
		/* Try next smaller order. */
		cur_order = cur_order - 1;
	} else {
		/* Skip to next chunk. */
		offset += 1 << cur_order;
		cur_order = max_order_from_offset(offset);
	}

Of course, handling disabled orders. max_order_from_offset() is rather trivial
(natural buddy order, capped at HPAGE_PMD_ORDER).

What's the benefit of the stack?

> 
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
> 
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> 
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order
> 
> Any other max_ptes_none value will emit a warning and default mTHP
> collapse to max_ptes_none=0. There should be no behavior change for PMD
> collapse.
> 
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> 
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
> 
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 172 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 64ceebc9d8a7..d3d7db8be26c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>  
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.

I was confused there for a second why you mention ilog2, when it's really "We
cannot use HPAGE_PMD_ORDER.".

Best to simplify to:

"Note that we cannot use HPAGE_PMD_ORDER, because it is variable on some
architectures".

> + */
> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>  
> @@ -110,6 +134,12 @@ struct collapse_control {
>  
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* Each bit represents a single occupied (!none/zero) page. */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);

This should just be called something like "present_ptes"

> +	/* A mask of the current range being considered for mTHP collapse. */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];

This is really just a temporary bitmap used for collapse_mthp_count_present()
only. Either rename it, or better, avoid it completely.

>  };
>  
>  /**
> @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>  
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				     u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> +						 int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> +						u16 offset, unsigned int nr_ptes)
> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);

You really just want to count the number of set bits? You don't need a temporary
bitmap for that.

Assume you want to check an order-2 (4 bits), bitmap_weight_and() would check
all bits ...

I'd suggest starting simple here, and avoiding the temporary bitmap.

Can we simply use bitmap_weight_from(cc->mthp_bitmap, offset, nr_ptes)?

> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, int referenced, int unmapped,
> +		struct collapse_control *cc, unsigned long enabled_orders)
> +{
> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> +	int collapsed = 0, stack_size = 0;
> +	unsigned long collapse_address;
> +	struct mthp_range range;
> +	u16 offset;
> +	u8 order;
> +
> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size) {
> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_ptes = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +
> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> +							       nr_ptes);
> +
> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_ptes;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if ((BIT(order) - 1) & enabled_orders) {
> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_ptes / 2);
> +
> +			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> +						 next_order);
> +			collapse_mthp_stack_push(cc, &stack_size, offset,
> +						 next_order);
> +		}
> +	}
> +	return collapsed;
> +}
> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> +	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>  	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	pte_t *pte, *_pte, pteval;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>  
> @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> +	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */

You should note here, that we re-verify in mthp_collapse().

But the question is, whether we should relocate the check completely into
mthp_collapse(), instead of conditionally duplicating it.

What speaks against always populating the bitmap and making the decision in
mthp_collapse()?

Sure, we might scan a page table a bit longer, but the code gets clearer ... and
I am not sure if scanning some more page table entries is really that critical here.


> +	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> +
>  	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>  	if (!pte) {
>  		cc->progress++;
> @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;
> +		addr = start_addr + i * PAGE_SIZE;
> +		pteval = ptep_get(_pte);
> +
>  		cc->progress++;
>  
> -		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>  
> +		/* Set bit for occupied pages */
> +		__set_bit(i, cc->mthp_bitmap);
>  		/*
>  		 * Record which node the original page is from and save this
>  		 * information to cc->node_load[].
> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  	if (result == SCAN_SUCCEED) {
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, HPAGE_PMD_ORDER);
> -		/* collapse_huge_page will return with the mmap_lock released */
> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> +					     unmapped, cc, enabled_orders);
> +		/* mmap_lock was released above, set lock_dropped */
>  		*lock_dropped = true;
> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;

As Lance says, this error handling likely needs some thought.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-01  8:11   ` David Hildenbrand (Arm)
@ 2026-06-01 12:40     ` Nico Pache
  2026-06-01 13:15       ` David Hildenbrand (Arm)
  2026-06-04 13:53     ` Lorenzo Stoakes
  1 sibling, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-06-01 12:40 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Usama Arif, usamaarif642
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

On Mon, Jun 1, 2026 at 2:11 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 5/22/26 17:00, Nico Pache wrote:
>
> Finally time for the core piece :)

*music intensifies* :p

>
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),
> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >     2) pop the stack
> >     3) check if the number of set bits in that (offset,order) pair
> >        statisfy the max_ptes_none threshold for that order
> >     4) if yes, attempt collapse
> >     5) if no (or collapse fails), push two new stack items representing
> >        the left and right halves of the current bitmap range, at the
> >        next lower order
> >     6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> >                             offset   mid_offset
> >                             |        |
> >                             |        |
> >                             v        v
> >           ____________________________________
> >          |          PTE Page Table            |
> >          --------------------------------------
> >                           <-------><------->
> >                              order-1  order-1
>
>
> Reading this, it is unclear why exactly do we need the stack.

So I looked into your items below. It seems logical, and I think it
works the same way; however, your method seems slightly harder to
understand due to all the edge cases and more error-prone to future
changes (the stack holds implicit knowledge of the offset/order that
must now be tracked in the edge cases).

Given the stack is 24 bytes, I'm not sure if the extra complexity is
worth saving that small amount of memory. Although we would also be
getting rid of (3?) functions, so both approaches have pros and cons.

I will implement a patch comparing your solution against mine and send
it here, then we can decide which approach is better.


>
> Why can't you work with offset + cur_order?
>
> Initially,
>
>         offset = 0;
>         cur_order = HPAGE_PMD_ORDER;
>
> If collapse succeeded, advance to next range.
> If collapse failed, try next smaller order, keeping offset unchanged.
>
>         if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
>                 /* Try next smaller order. */
>                 cur_order = cur_order - 1;
>         } else {
>                 /* Skip to next chunk. */
>                 offset += 1 << cur_order;
>                 cur_order = max_order_from_offset(offset);
>         }
>
> Of course, handling disabled orders. max_order_from_offset() is rather trivial
> (natural buddy order, capped at HPAGE_PMD_ORDER).
>
> What's the benefit of the stack?
>
> >
> > mTHP collapses reject regions containing swapped out or shared pages.
> > This is because adding new entries can lead to new none pages, and these
> > may lead to constant promotion into a higher order mTHP. A similar
> > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > introducing at least 2x the number of pages, and on a future scan will
> > satisfy the promotion condition once again. This issue is prevented via
> > the collapse_max_ptes_none() function which imposes the max_ptes_none
> > restrictions above.
> >
> > We currently only support mTHP collapse for max_ptes_none values of 0
> > and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >
> >     - max_ptes_none=0: Never introduce new empty pages during collapse
> >     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >       available mTHP order
> >
> > Any other max_ptes_none value will emit a warning and default mTHP
> > collapse to max_ptes_none=0. There should be no behavior change for PMD
> > collapse.
> >
> > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >
> > Currently madv_collapse is not supported and will only attempt PMD
> > collapse.
> >
> > We can also remove the check for is_khugepaged inside the PMD scan as
> > the collapse_max_ptes_none() function handles this logic now.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 172 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 64ceebc9d8a7..d3d7db8be26c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > +/*
> > + * mthp_collapse() does an iterative DFS over a binary tree, from
> > + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > + * size needed for a DFS on a binary tree is height + 1, where
> > + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > + *
> > + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
>
> I was confused there for a second why you mention ilog2, when it's really "We
> cannot use HPAGE_PMD_ORDER.".
>
> Best to simplify to:
>
> "Note that we cannot use HPAGE_PMD_ORDER, because it is variable on some
> architectures".

Ok thank you i can clear that up.

>
> > + */
> > +#define MTHP_STACK_SIZE      (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > +
> > +/*
> > + * Defines a range of PTE entries in a PTE page table which are being
> > + * considered for mTHP collapse.
> > + *
> > + * @offset: the offset of the first PTE entry in a PMD range.
> > + * @order: the order of the PTE entries being considered for collapse.
> > + */
> > +struct mthp_range {
> > +     u16 offset;
> > +     u8 order;
> > +};
> > +
> >  struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -110,6 +134,12 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /* Each bit represents a single occupied (!none/zero) page. */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
>
> This should just be called something like "present_ptes"

yeah not a bad idea.

>
> > +     /* A mask of the current range being considered for mTHP collapse. */
> > +     DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>
> This is really just a temporary bitmap used for collapse_mthp_count_present()
> only. Either rename it, or better, avoid it completely.

yeah when i first started this we didnt have bitmap_weight_from()
thanks for the pointer to that, I no longer need a temp bitmap.

>
> >  };
> >
> >  /**
> > @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >       return result;
> >  }
> >
> > +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > +                                  u16 offset, u8 order)
> > +{
> > +     const int size = *stack_size;
> > +     struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > +
> > +     VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > +     stack->order = order;
> > +     stack->offset = offset;
> > +     (*stack_size)++;
> > +}
> > +
> > +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > +                                              int *stack_size)
> > +{
> > +     const int size = *stack_size;
> > +
> > +     VM_WARN_ON_ONCE(size <= 0);
> > +     (*stack_size)--;
> > +     return cc->mthp_bitmap_stack[size - 1];
> > +}
> > +
> > +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> > +                                             u16 offset, unsigned int nr_ptes)
> > +{
> > +     bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> > +     return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>
> You really just want to count the number of set bits? You don't need a temporary
> bitmap for that.
>
> Assume you want to check an order-2 (4 bits), bitmap_weight_and() would check
> all bits ...
>
> I'd suggest starting simple here, and avoiding the temporary bitmap.
>
> Can we simply use bitmap_weight_from(cc->mthp_bitmap, offset, nr_ptes)?

Yes! Thank you :)

>
> > +}
> > +
> > +/*
> > + * mthp_collapse() consumes the bitmap that is generated during
> > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > + *
> > + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> > + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> > + * of the bitmap for collapse eligibility. The stack maintains a pair of
> > + * variables (offset, order), indicating the number of PTEs from the start of
> > + * the PMD, and the order of the potential collapse candidate respectively. We
> > + * start at the PMD order and check if it is eligible for collapse; if not, we
> > + * add two entries to the stack at a lower order to represent the left and right
> > + * halves of the PTE page table we are examining.
> > + *
> > + *                         offset       mid_offset
> > + *                         |         |
> > + *                         |         |
> > + *                         v         v
> > + *      --------------------------------------
> > + *      |          cc->mthp_bitmap            |
> > + *      --------------------------------------
> > + *                         <-------><------->
> > + *                          order-1  order-1
> > + *
> > + * For each of these, we determine how many PTE entries are occupied in the
> > + * range of PTE entries we propose to collapse, then we compare this to a
> > + * threshold number of PTE entries which would need to be occupied for a
> > + * collapse to be permitted at that order (accounting for max_ptes_none).
> > + *
> > + * If a collapse is permitted, we attempt to collapse the PTE range into a
> > + * mTHP.
> > + */
> > +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> > +             unsigned long address, int referenced, int unmapped,
> > +             struct collapse_control *cc, unsigned long enabled_orders)
> > +{
> > +     unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > +     int collapsed = 0, stack_size = 0;
> > +     unsigned long collapse_address;
> > +     struct mthp_range range;
> > +     u16 offset;
> > +     u8 order;
> > +
> > +     collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> > +
> > +     while (stack_size) {
> > +             range = collapse_mthp_stack_pop(cc, &stack_size);
> > +             order = range.order;
> > +             offset = range.offset;
> > +             nr_ptes = 1UL << order;
> > +
> > +             if (!test_bit(order, &enabled_orders))
> > +                     goto next_order;
> > +
> > +             max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> > +
> > +             nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > +                                                            nr_ptes);
> > +
> > +             if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> > +                     int ret;
> > +
> > +                     collapse_address = address + offset * PAGE_SIZE;
> > +                     ret = collapse_huge_page(mm, collapse_address, referenced,
> > +                                              unmapped, cc, order);
> > +                     if (ret == SCAN_SUCCEED) {
> > +                             collapsed += nr_ptes;
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +next_order:
> > +             if ((BIT(order) - 1) & enabled_orders) {
> > +                     const u8 next_order = order - 1;
> > +                     const u16 mid_offset = offset + (nr_ptes / 2);
> > +
> > +                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > +                                              next_order);
> > +                     collapse_mthp_stack_push(cc, &stack_size, offset,
> > +                                              next_order);
> > +             }
> > +     }
> > +     return collapsed;
> > +}
> > +
> >  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> >               bool *lock_dropped, struct collapse_control *cc)
> >  {
> > -     const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> > +     unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > +     enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> >       pmd_t *pmd;
> > -     pte_t *pte, *_pte;
> > -     int none_or_zero = 0, shared = 0, referenced = 0;
> > +     pte_t *pte, *_pte, pteval;
> > +     int i;
> > +     int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> >       enum scan_result result = SCAN_FAIL;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long addr;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >
> > @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> > +
> > +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > +
> > +     /*
> > +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > +      * scan all pages to populate the bitmap for mTHP collapse.
> > +      */
>
> You should note here, that we re-verify in mthp_collapse().
>
> But the question is, whether we should relocate the check completely into
> mthp_collapse(), instead of conditionally duplicating it.
>
> What speaks against always populating the bitmap and making the decision in
> mthp_collapse()?
>
> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> I am not sure if scanning some more page table entries is really that critical here.

Someone asked me to preserve the legacy behavior (PMD only). Although
rather trivial, if you set max_ptes_none=0 for example, we'd still
have to do 511 iterations for no reason if PMD collapse is the only
enabled order rather than bailing immediately.

I'm ok with dropping it, but I think its the correct approach (despite
the extra complexity). @Usama Arif brought up this point here
https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/

>
>
> > +     if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> > +             max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> > +
> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >       if (!pte) {
> >               cc->progress++;
> > @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, addr += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             _pte = pte + i;
> > +             addr = start_addr + i * PAGE_SIZE;
> > +             pteval = ptep_get(_pte);
> > +
> >               cc->progress++;
> >
> > -             pte_t pteval = ptep_get(_pte);
> >               if (pte_none_or_zero(pteval)) {
> >                       if (++none_or_zero > max_ptes_none) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >
> > +             /* Set bit for occupied pages */
> > +             __set_bit(i, cc->mthp_bitmap);
> >               /*
> >                * Record which node the original page is from and save this
> >                * information to cc->node_load[].
> > @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> > -             result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc, HPAGE_PMD_ORDER);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > +             nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> > +                                          unmapped, cc, enabled_orders);
> > +             /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> > +             result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>
> As Lance says, this error handling likely needs some thought.

Yes I agree, I'm working on that now. I had a WIP patch for this some
time ago, but I never gave it the full thought it needed, and it fell
into the abyss.

Thanks :)
-- Nico

>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-01 12:40     ` Nico Pache
@ 2026-06-01 13:15       ` David Hildenbrand (Arm)
  2026-06-02 17:23         ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-01 13:15 UTC (permalink / raw)
  To: Nico Pache, Usama Arif, usamaarif642
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe

>>
>> Reading this, it is unclear why exactly do we need the stack.
> 
> So I looked into your items below. It seems logical, and I think it
> works the same way; however, your method seems slightly harder to
> understand due to all the edge cases and more error-prone to future
> changes (the stack holds implicit knowledge of the offset/order that
> must now be tracked in the edge cases).
> 
> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> worth saving that small amount of memory. Although we would also be
> getting rid of (3?) functions, so both approaches have pros and cons.

I consider a simple forward loop over the offset ... less complexity compared to
a stack structure :)

> 
> I will implement a patch comparing your solution against mine and send
> it here, then we can decide which approach is better.

Right, throw it over the fence and I'll see how to improve it further.

[...]

>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
>>>       nodes_clear(cc->alloc_nmask);
>>> +
>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>>> +
>>> +     /*
>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>>> +      * scan all pages to populate the bitmap for mTHP collapse.
>>> +      */
>>
>> You should note here, that we re-verify in mthp_collapse().
>>
>> But the question is, whether we should relocate the check completely into
>> mthp_collapse(), instead of conditionally duplicating it.
>>
>> What speaks against always populating the bitmap and making the decision in
>> mthp_collapse()?
>>
>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
>> I am not sure if scanning some more page table entries is really that critical here.
> 
> Someone asked me to preserve the legacy behavior (PMD only). Although
> rather trivial, if you set max_ptes_none=0 for example, we'd still
> have to do 511 iterations for no reason if PMD collapse is the only
> enabled order rather than bailing immediately.
> 
> I'm ok with dropping it, but I think its the correct approach (despite
> the extra complexity). @Usama Arif brought up this point here
> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/

We talk about regressions, but I am not sure if we care about scanning speed
within a page table that much?

After all, we locked it and already read some entries.

Having the same check at two places to optimize for PMD order might right now
feel like a good optimization, but likely an irrelevant one in a near future?

Anyhow, won't push back, as long as we document why we are special casing things
here.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-01 13:15       ` David Hildenbrand (Arm)
@ 2026-06-02 17:23         ` Nico Pache
  2026-06-02 17:26           ` Nico Pache
                             ` (3 more replies)
  0 siblings, 4 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-02 17:23 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642



On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
>>>
>>> Reading this, it is unclear why exactly do we need the stack.
>>
>> So I looked into your items below. It seems logical, and I think it
>> works the same way; however, your method seems slightly harder to
>> understand due to all the edge cases and more error-prone to future
>> changes (the stack holds implicit knowledge of the offset/order that
>> must now be tracked in the edge cases).
>>
>> Given the stack is 24 bytes, I'm not sure if the extra complexity is
>> worth saving that small amount of memory. Although we would also be
>> getting rid of (3?) functions, so both approaches have pros and cons.
> 
> I consider a simple forward loop over the offset ... less complexity compared to
> a stack structure :)
> 
>>
>> I will implement a patch comparing your solution against mine and send
>> it here, then we can decide which approach is better.
> 
> Right, throw it over the fence and I'll see how to improve it further.

Ok heres what the diff looks like on top of my V19. 

you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.

So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow. 


From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
From: Nico Pache <npache@redhat.com>
Date: Tue, 2 Jun 2026 10:26:18 -0600
Subject: [PATCH] convert from stack to forward loop

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 96 ++++++++-----------------------------------------
 1 file changed, 15 insertions(+), 81 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 498eba009751..6de935e76ceb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
 #define KHUGEPAGED_MIN_MTHP_ORDER	2
-/*
- * mthp_collapse() does an iterative DFS over a binary tree, from
- * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
- * size needed for a DFS on a binary tree is height + 1, where
- * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
- *
- * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
- * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
- */
-#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
-
-/*
- * Defines a range of PTE entries in a PTE page table which are being
- * considered for mTHP collapse.
- *
- * @offset: the offset of the first PTE entry in a PMD range.
- * @order: the order of the PTE entries being considered for collapse.
- */
-struct mthp_range {
-	u16 offset;
-	u8 order;
-};
 
 struct collapse_control {
 	bool is_khugepaged;
@@ -137,7 +115,6 @@ struct collapse_control {
 
 	/* Each bit represents a single occupied (!none/zero) page. */
 	DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
-	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
 };
 
 /**
@@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	return result;
 }
 
-static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
-				     u16 offset, u8 order)
-{
-	const int size = *stack_size;
-	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
-
-	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
-	stack->order = order;
-	stack->offset = offset;
-	(*stack_size)++;
-}
-
-static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
-						 int *stack_size)
-{
-	const int size = *stack_size;
-
-	VM_WARN_ON_ONCE(size <= 0);
-	(*stack_size)--;
-	return cc->mthp_bitmap_stack[size - 1];
-}
-
 /*
  * mthp_collapse() consumes the bitmap that is generated during
  * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
  *
  * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
- * page. A stack structure cc->mthp_bitmap_stack is used to check different
- * regions of the bitmap for collapse eligibility. The stack maintains a pair
- * of variables (offset, order), indicating the number of PTEs from the start
- * of the PMD, and the order of the potential collapse candidate respectively.
- * We start at the PMD order and check if it is eligible for collapse; if not,
- * we add two entries to the stack at a lower order to represent the left and
- * right halves of the PTE page table we are examining.
- *
- *                         offset       mid_offset
- *                         |         |
- *                         |         |
- *                         v         v
- *      --------------------------------------
- *      |       cc->mthp_present_ptes         |
- *      --------------------------------------
- *                         <-------><------->
- *                          order-1  order-1
+ * page. We start at the PMD order and check if it is eligible for collapse;
+ * if not, we check the left and right halves of the PTE page table we are
+ * examining at a lower order.
  *
  * For each of these, we determine how many PTE entries are occupied in the
  * range of PTE entries we propose to collapse, then we compare this to a
@@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
 {
 	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
 	enum scan_result last_result = SCAN_FAIL;
-	int collapsed = 0, stack_size = 0;
+	int collapsed = 0;
 	bool alloc_failed = false;
 	unsigned long collapse_address;
-	struct mthp_range range;
-	u16 offset;
-	u8 order;
+	unsigned int offset = 0;
+	unsigned int order = HPAGE_PMD_ORDER;
 
-	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
 
-	while (stack_size) {
-		range = collapse_mthp_stack_pop(cc, &stack_size);
-		order = range.order;
-		offset = range.offset;
+	while (offset < HPAGE_PMD_NR) {
 		nr_ptes = 1UL << order;
 
 		if (!test_bit(order, &enabled_orders))
 			goto next_order;
 
 		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
-
 		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
 						      offset + nr_ptes);
 
@@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
 				collapsed += nr_ptes;
 				fallthrough;
 			case SCAN_PTE_MAPPED_HUGEPAGE:
-				continue;
+				goto next_offset;
 			/* Cases where lower orders might still succeed */
 			case SCAN_ALLOC_HUGE_PAGE_FAIL:
 				alloc_failed = true;
@@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
 		}
 
 next_order:
-		if ((BIT(order) - 1) & enabled_orders) {
-			const u8 next_order = order - 1;
-			const u16 mid_offset = offset + (nr_ptes / 2);
-
-			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
-						 next_order);
-			collapse_mthp_stack_push(cc, &stack_size, offset,
-						 next_order);
+		if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
+			(BIT(order) - 1) & enabled_orders) {
+			order = order - 1;
+			continue;
 		}
+next_offset:
+		offset += nr_ptes;
+		order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
 	}
 done:
 	if (collapsed)
-- 
2.54.0



> 
> [...]
> 
>>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
>>>>       nodes_clear(cc->alloc_nmask);
>>>> +
>>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>>>> +
>>>> +     /*
>>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>>>> +      * scan all pages to populate the bitmap for mTHP collapse.
>>>> +      */
>>>
>>> You should note here, that we re-verify in mthp_collapse().
>>>
>>> But the question is, whether we should relocate the check completely into
>>> mthp_collapse(), instead of conditionally duplicating it.
>>>
>>> What speaks against always populating the bitmap and making the decision in
>>> mthp_collapse()?
>>>
>>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
>>> I am not sure if scanning some more page table entries is really that critical here.
>>
>> Someone asked me to preserve the legacy behavior (PMD only). Although
>> rather trivial, if you set max_ptes_none=0 for example, we'd still
>> have to do 511 iterations for no reason if PMD collapse is the only
>> enabled order rather than bailing immediately.
>>
>> I'm ok with dropping it, but I think its the correct approach (despite
>> the extra complexity). @Usama Arif brought up this point here
>> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> 
> We talk about regressions, but I am not sure if we care about scanning speed
> within a page table that much?
> 
> After all, we locked it and already read some entries.
> 
> Having the same check at two places to optimize for PMD order might right now
> feel like a good optimization, but likely an irrelevant one in a near future?
> 
> Anyhow, won't push back, as long as we document why we are special casing things
> here.
> 


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-02 17:23         ` Nico Pache
@ 2026-06-02 17:26           ` Nico Pache
  2026-06-03  9:55           ` David Hildenbrand (Arm)
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-02 17:26 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642

On Tue, Jun 2, 2026 at 11:22 AM Nico Pache <npache@redhat.com> wrote:
>
>
>
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> >>>
> >>> Reading this, it is unclear why exactly do we need the stack.
> >>
> >> So I looked into your items below. It seems logical, and I think it
> >> works the same way; however, your method seems slightly harder to
> >> understand due to all the edge cases and more error-prone to future
> >> changes (the stack holds implicit knowledge of the offset/order that
> >> must now be tracked in the edge cases).
> >>
> >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> >> worth saving that small amount of memory. Although we would also be
> >> getting rid of (3?) functions, so both approaches have pros and cons.
> >
> > I consider a simple forward loop over the offset ... less complexity compared to
> > a stack structure :)
> >
> >>
> >> I will implement a patch comparing your solution against mine and send
> >> it here, then we can decide which approach is better.
> >
> > Right, throw it over the fence and I'll see how to improve it further.
>
> Ok heres what the diff looks like on top of my V19.
>
> you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
>
> So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.

not sure why this didnt send with the proper encoding I guess my email
is still a little screwed up

>
>
> From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> From: Nico Pache <npache@redhat.com>
> Date: Tue, 2 Jun 2026 10:26:18 -0600
> Subject: [PATCH] convert from stack to forward loop
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 96 ++++++++-----------------------------------------
>  1 file changed, 15 insertions(+), 81 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 498eba009751..6de935e76ceb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
>  #define KHUGEPAGED_MIN_MTHP_ORDER      2
> -/*
> - * mthp_collapse() does an iterative DFS over a binary tree, from
> - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> - * size needed for a DFS on a binary tree is height + 1, where
> - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> - *
> - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> - */
> -#define MTHP_STACK_SIZE        (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> -
> -/*
> - * Defines a range of PTE entries in a PTE page table which are being
> - * considered for mTHP collapse.
> - *
> - * @offset: the offset of the first PTE entry in a PMD range.
> - * @order: the order of the PTE entries being considered for collapse.
> - */
> -struct mthp_range {
> -       u16 offset;
> -       u8 order;
> -};
>
>  struct collapse_control {
>         bool is_khugepaged;
> @@ -137,7 +115,6 @@ struct collapse_control {
>
>         /* Each bit represents a single occupied (!none/zero) page. */
>         DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> -       struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>  };
>
>  /**
> @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>         return result;
>  }
>
> -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> -                                    u16 offset, u8 order)
> -{
> -       const int size = *stack_size;
> -       struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> -
> -       VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> -       stack->order = order;
> -       stack->offset = offset;
> -       (*stack_size)++;
> -}
> -
> -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> -                                                int *stack_size)
> -{
> -       const int size = *stack_size;
> -
> -       VM_WARN_ON_ONCE(size <= 0);
> -       (*stack_size)--;
> -       return cc->mthp_bitmap_stack[size - 1];
> -}
> -
>  /*
>   * mthp_collapse() consumes the bitmap that is generated during
>   * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>   *
>   * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> - * of variables (offset, order), indicating the number of PTEs from the start
> - * of the PMD, and the order of the potential collapse candidate respectively.
> - * We start at the PMD order and check if it is eligible for collapse; if not,
> - * we add two entries to the stack at a lower order to represent the left and
> - * right halves of the PTE page table we are examining.
> - *
> - *                         offset       mid_offset
> - *                         |         |
> - *                         |         |
> - *                         v         v
> - *      --------------------------------------
> - *      |       cc->mthp_present_ptes         |
> - *      --------------------------------------
> - *                         <-------><------->
> - *                          order-1  order-1
> + * page. We start at the PMD order and check if it is eligible for collapse;
> + * if not, we check the left and right halves of the PTE page table we are
> + * examining at a lower order.
>   *
>   * For each of these, we determine how many PTE entries are occupied in the
>   * range of PTE entries we propose to collapse, then we compare this to a
> @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>  {
>         unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>         enum scan_result last_result = SCAN_FAIL;
> -       int collapsed = 0, stack_size = 0;
> +       int collapsed = 0;
>         bool alloc_failed = false;
>         unsigned long collapse_address;
> -       struct mthp_range range;
> -       u16 offset;
> -       u8 order;
> +       unsigned int offset = 0;
> +       unsigned int order = HPAGE_PMD_ORDER;
>
> -       collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>
> -       while (stack_size) {
> -               range = collapse_mthp_stack_pop(cc, &stack_size);
> -               order = range.order;
> -               offset = range.offset;
> +       while (offset < HPAGE_PMD_NR) {
>                 nr_ptes = 1UL << order;
>
>                 if (!test_bit(order, &enabled_orders))
>                         goto next_order;
>
>                 max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> -
>                 nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>                                                       offset + nr_ptes);
>
> @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>                                 collapsed += nr_ptes;
>                                 fallthrough;
>                         case SCAN_PTE_MAPPED_HUGEPAGE:
> -                               continue;
> +                               goto next_offset;
>                         /* Cases where lower orders might still succeed */
>                         case SCAN_ALLOC_HUGE_PAGE_FAIL:
>                                 alloc_failed = true;
> @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>                 }
>
>  next_order:
> -               if ((BIT(order) - 1) & enabled_orders) {
> -                       const u8 next_order = order - 1;
> -                       const u16 mid_offset = offset + (nr_ptes / 2);
> -
> -                       collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> -                                                next_order);
> -                       collapse_mthp_stack_push(cc, &stack_size, offset,
> -                                                next_order);
> +               if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> +                       (BIT(order) - 1) & enabled_orders) {
> +                       order = order - 1;
> +                       continue;
>                 }
> +next_offset:
> +               offset += nr_ptes;
> +               order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
>         }
>  done:
>         if (collapsed)
> --
> 2.54.0
>
>
>
> >
> > [...]
> >
> >>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
> >>>>       nodes_clear(cc->alloc_nmask);
> >>>> +
> >>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> >>>> +
> >>>> +     /*
> >>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >>>> +      * scan all pages to populate the bitmap for mTHP collapse.
> >>>> +      */
> >>>
> >>> You should note here, that we re-verify in mthp_collapse().
> >>>
> >>> But the question is, whether we should relocate the check completely into
> >>> mthp_collapse(), instead of conditionally duplicating it.
> >>>
> >>> What speaks against always populating the bitmap and making the decision in
> >>> mthp_collapse()?
> >>>
> >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> >>> I am not sure if scanning some more page table entries is really that critical here.
> >>
> >> Someone asked me to preserve the legacy behavior (PMD only). Although
> >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> >> have to do 511 iterations for no reason if PMD collapse is the only
> >> enabled order rather than bailing immediately.
> >>
> >> I'm ok with dropping it, but I think its the correct approach (despite
> >> the extra complexity). @Usama Arif brought up this point here
> >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> >
> > We talk about regressions, but I am not sure if we care about scanning speed
> > within a page table that much?
> >
> > After all, we locked it and already read some entries.
> >
> > Having the same check at two places to optimize for PMD order might right now
> > feel like a good optimization, but likely an irrelevant one in a near future?
> >
> > Anyhow, won't push back, as long as we document why we are special casing things
> > here.
> >
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-02 17:23         ` Nico Pache
  2026-06-02 17:26           ` Nico Pache
@ 2026-06-03  9:55           ` David Hildenbrand (Arm)
  2026-06-03 10:00           ` David Hildenbrand (Arm)
  2026-06-04 14:14           ` Lorenzo Stoakes
  3 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-03  9:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642

On 6/2/26 19:23, Nico Pache wrote:
> 
> 
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
>>>
>>> So I looked into your items below. It seems logical, and I think it
>>> works the same way; however, your method seems slightly harder to
>>> understand due to all the edge cases and more error-prone to future
>>> changes (the stack holds implicit knowledge of the offset/order that
>>> must now be tracked in the edge cases).
>>>
>>> Given the stack is 24 bytes, I'm not sure if the extra complexity is
>>> worth saving that small amount of memory. Although we would also be
>>> getting rid of (3?) functions, so both approaches have pros and cons.
>>
>> I consider a simple forward loop over the offset ... less complexity compared to
>> a stack structure :)
>>
>>>
>>> I will implement a patch comparing your solution against mine and send
>>> it here, then we can decide which approach is better.
>>
>> Right, throw it over the fence and I'll see how to improve it further.
> 
> Ok heres what the diff looks like on top of my V19. 
> 
> you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
> 
> So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow. 

It's very clean.

Almost too nice to be true ;)

[...]

>  	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>  	enum scan_result last_result = SCAN_FAIL;
> -	int collapsed = 0, stack_size = 0;
> +	int collapsed = 0;
>  	bool alloc_failed = false;
>  	unsigned long collapse_address;
> -	struct mthp_range range;
> -	u16 offset;
> -	u8 order;
> +	unsigned int offset = 0;
> +	unsigned int order = HPAGE_PMD_ORDER;


In include/linux/huge_mm.h we have

	highest_order()

and

	next_order()

They essentially allow you to get rid of the test_bit() and just jump to the
next enabled order right away.

I assume with only a handful of enabled_orders, that might be much more efficient.

I tried to optimize it and ended with the following, which is completely untested.

I think it might make sense to defer that and start with the simple approach you have.

I do wonder, though, about the last hunk below: should we bail out early if
enabled_orders is suddenly 0?



From 0d8ff955b3071f354b7fc9b627820fa374fa99dc Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Wed, 3 Jun 2026 11:52:44 +0200
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
 include/linux/huge_mm.h |   5 ++
 mm/khugepaged.c         | 132 ++++++++++++++++++++++------------------
 2 files changed, 78 insertions(+), 59 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 48496f09909b..099318bc1181 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -205,6 +205,11 @@ static inline int highest_order(unsigned long orders)
 	return fls_long(orders) - 1;
 }
 
+static inline int smallest_order(unsigned long orders)
+{
+	return __ffs(orders);
+}
+
 static inline int next_order(unsigned long *orders, int prev)
 {
 	*orders &= ~BIT(prev);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6de935e76ceb..49be9d1a88cb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -99,8 +99,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
-#define KHUGEPAGED_MIN_MTHP_ORDER	2
-
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -1454,76 +1452,86 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
  */
 static enum scan_result mthp_collapse(struct mm_struct *mm,
 		unsigned long address, int referenced, int unmapped,
-		struct collapse_control *cc, unsigned long enabled_orders)
+		struct collapse_control *cc, const unsigned long enabled_orders)
 {
-	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
 	enum scan_result last_result = SCAN_FAIL;
 	int collapsed = 0;
 	bool alloc_failed = false;
 	unsigned long collapse_address;
 	unsigned int offset = 0;
-	unsigned int order = HPAGE_PMD_ORDER;
 
+	/* We cannot collapse anon folios to order-1 or order-0. */
+	VM_WARN_ON_ONCE(!enabled_order || (enabled_orders & 0x3));
 
 	while (offset < HPAGE_PMD_NR) {
-		nr_ptes = 1UL << order;
-
-		if (!test_bit(order, &enabled_orders))
-			goto next_order;
-
-		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
-		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
-						      offset + nr_ptes);
-
-		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
-			enum scan_result ret;
-
-			collapse_address = address + offset * PAGE_SIZE;
-			ret = collapse_huge_page(mm, collapse_address, referenced,
-						 unmapped, cc, order);
-
-			switch (ret) {
-			/* Cases where we continue to next collapse candidate */
-			case SCAN_SUCCEED:
-				collapsed += nr_ptes;
-				fallthrough;
-			case SCAN_PTE_MAPPED_HUGEPAGE:
-				goto next_offset;
-			/* Cases where lower orders might still succeed */
-			case SCAN_ALLOC_HUGE_PAGE_FAIL:
-				alloc_failed = true;
-				fallthrough;
-			case SCAN_LACK_REFERENCED_PAGE:
-			case SCAN_EXCEED_NONE_PTE:
-			case SCAN_EXCEED_SWAP_PTE:
-			case SCAN_EXCEED_SHARED_PTE:
-			case SCAN_PAGE_LOCK:
-			case SCAN_PAGE_COUNT:
-			case SCAN_PAGE_NULL:
-			case SCAN_DEL_PAGE_LRU:
-			case SCAN_PTE_NON_PRESENT:
-			case SCAN_PTE_UFFD_WP:
-			case SCAN_PAGE_LAZYFREE:
-				last_result = ret;
-				goto next_order;
-			/* Cases where no further collapse is possible */
-			case SCAN_PMD_MAPPED:
-				fallthrough;
-			default:
-				last_result = ret;
-				goto done;
+		/*
+		 * We can only collapse to a maximum order for a given offset.
+		 * So ignore all orders that do not apply to the current
+		 * offset, then see if any order to collapse to remains.
+		 */
+		unsigned long orders = enabled_orders & GENMASK(__ffs(offset), 0);
+		unsigned int order = highest_order(orders);
+
+		while (order) {
+			const unsigned int nr_ptes = 1UL << order;
+			unsigned int nr_occupied_ptes, max_ptes_none;
+
+			max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
+			nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
+							      offset + nr_ptes);
+
+			if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
+				enum scan_result ret;
+
+				collapse_address = address + offset * PAGE_SIZE;
+				ret = collapse_huge_page(mm, collapse_address, referenced,
+							 unmapped, cc, order);
+
+				switch (ret) {
+				/* Cases where we continue to next collapse candidate */
+				case SCAN_SUCCEED:
+					collapsed += nr_ptes;
+					fallthrough;
+				case SCAN_PTE_MAPPED_HUGEPAGE:
+					goto next_offset;
+				/* Cases where lower orders might still succeed */
+				case SCAN_ALLOC_HUGE_PAGE_FAIL:
+					alloc_failed = true;
+					fallthrough;
+				case SCAN_LACK_REFERENCED_PAGE:
+				case SCAN_EXCEED_NONE_PTE:
+				case SCAN_EXCEED_SWAP_PTE:
+				case SCAN_EXCEED_SHARED_PTE:
+				case SCAN_PAGE_LOCK:
+				case SCAN_PAGE_COUNT:
+				case SCAN_PAGE_NULL:
+				case SCAN_DEL_PAGE_LRU:
+				case SCAN_PTE_NON_PRESENT:
+				case SCAN_PTE_UFFD_WP:
+				case SCAN_PAGE_LAZYFREE:
+					last_result = ret;
+					break;
+				/* Cases where no further collapse is possible */
+				case SCAN_PMD_MAPPED:
+					fallthrough;
+				default:
+					last_result = ret;
+					goto done;
+				}
 			}
-		}
 
-next_order:
-		if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
-			(BIT(order) - 1) & enabled_orders) {
-			order = order - 1;
-			continue;
+			order = next_order(&orders, order);
 		}
+
 next_offset:
-		offset += nr_ptes;
-		order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
+		/*
+		 * Continue with the next collapse candidate. If we do not
+		 * have an order, skip to nest smallest mTHP we can collapse to.
+		 */
+		if (order)
+			offset += 1UL << order;
+		else
+			offset = ALIGN(offset + 1, smallest_order(enabled_orders));
 	}
 done:
 	if (collapsed)
@@ -1567,6 +1575,12 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 
 	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
 
+	if (unlikely(!enabled_orders)) {
+		cc->progress++;
+		result = SCAN_SUCCEED;
+		goto out;
+	}
+
 	/*
 	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
 	 * scan all pages to populate the bitmap for mTHP collapse.
-- 
2.43.0


-- 
Cheers,

David

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-02 17:23         ` Nico Pache
  2026-06-02 17:26           ` Nico Pache
  2026-06-03  9:55           ` David Hildenbrand (Arm)
@ 2026-06-03 10:00           ` David Hildenbrand (Arm)
  2026-06-03 12:16             ` Nico Pache
  2026-06-04 14:14           ` Lorenzo Stoakes
  3 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-03 10:00 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642


>  next_order:
> -		if ((BIT(order) - 1) & enabled_orders) {
> -			const u8 next_order = order - 1;
> -			const u16 mid_offset = offset + (nr_ptes / 2);
> -
> -			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> -						 next_order);
> -			collapse_mthp_stack_push(cc, &stack_size, offset,
> -						 next_order);
> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> +			(BIT(order) - 1) & enabled_orders) {

Why not a test_bit() ?


But, wouldn't you want to skip orders that are not enabled and try with the next
smaller one in any case before you advance the offset?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-03 10:00           ` David Hildenbrand (Arm)
@ 2026-06-03 12:16             ` Nico Pache
  2026-06-03 12:27               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-06-03 12:16 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642

On Wed, Jun 3, 2026 at 4:01 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
>
> >  next_order:
> > -             if ((BIT(order) - 1) & enabled_orders) {
> > -                     const u8 next_order = order - 1;
> > -                     const u16 mid_offset = offset + (nr_ptes / 2);
> > -
> > -                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > -                                              next_order);
> > -                     collapse_mthp_stack_push(cc, &stack_size, offset,
> > -                                              next_order);
> > +             if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> > +                     (BIT(order) - 1) & enabled_orders) {
>
> Why not a test_bit() ?

The test bit is at the top of the loop. This adds a exit if the lower
orders are all disabled or we hit the last order.

>
>
> But, wouldn't you want to skip orders that are not enabled and try with the next
> smaller one in any case before you advance the offset?

We are currently iterating through each order (not skipping them).
There may be optimizations to avoid iterating through every order
(like your changes suggest), but currently, every collapse, whether it
succeeds or fails at the bottom order, must also iterate the offset.

lmk if that makes sense!
-- Nico

>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-03 12:16             ` Nico Pache
@ 2026-06-03 12:27               ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-03 12:27 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642

On 6/3/26 14:16, Nico Pache wrote:
> On Wed, Jun 3, 2026 at 4:01 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>>
>>>  next_order:
>>> -             if ((BIT(order) - 1) & enabled_orders) {
>>> -                     const u8 next_order = order - 1;
>>> -                     const u16 mid_offset = offset + (nr_ptes / 2);
>>> -
>>> -                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
>>> -                                              next_order);
>>> -                     collapse_mthp_stack_push(cc, &stack_size, offset,
>>> -                                              next_order);
>>> +             if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
>>> +                     (BIT(order) - 1) & enabled_orders) {
>>
>> Why not a test_bit() ?
> 
> The test bit is at the top of the loop. This adds a exit if the lower
> orders are all disabled or we hit the last order.

Ah, now I understand what you want to do here.

I guess you should add a () around the & and maybe add a comment.

And likely using GENMASK is clearer?

/*
 * Continue with the next smaller order if there is still
 * any smaller order enabled.
 */
if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
    (enabled_orders & GENMASK(order - 1, 0))) {
	...
}


> 
>>
>>
>> But, wouldn't you want to skip orders that are not enabled and try with the next
>> smaller one in any case before you advance the offset?
> 
> We are currently iterating through each order (not skipping them).
> There may be optimizations to avoid iterating through every order
> (like your changes suggest), but currently, every collapse, whether it
> succeeds or fails at the bottom order, must also iterate the offset.

Right, we currently miss opportunities to just skip, that we can optimize later.


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-02 17:23         ` Nico Pache
                             ` (2 preceding siblings ...)
  2026-06-03 10:00           ` David Hildenbrand (Arm)
@ 2026-06-04 14:14           ` Lorenzo Stoakes
  2026-06-04 14:19             ` Lorenzo Stoakes
  3 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 14:14 UTC (permalink / raw)
  To: Nico Pache
  Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, lance.yang, liam,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, vbabka, vishal.moola, wangkefeng.wang,
	will, willy, yang, ying.huang, ziy, zokeefe, Usama Arif,
	usamaarif642

On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote:
>
>
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> >>>
> >>> Reading this, it is unclear why exactly do we need the stack.
> >>
> >> So I looked into your items below. It seems logical, and I think it
> >> works the same way; however, your method seems slightly harder to
> >> understand due to all the edge cases and more error-prone to future
> >> changes (the stack holds implicit knowledge of the offset/order that
> >> must now be tracked in the edge cases).
> >>
> >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> >> worth saving that small amount of memory. Although we would also be
> >> getting rid of (3?) functions, so both approaches have pros and cons.
> >
> > I consider a simple forward loop over the offset ... less complexity compared to
> > a stack structure :)
> >
> >>
> >> I will implement a patch comparing your solution against mine and send
> >> it here, then we can decide which approach is better.
> >
> > Right, throw it over the fence and I'll see how to improve it further.
>
> Ok heres what the diff looks like on top of my V19.
>
> you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
>
> So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.
>
>
> From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> From: Nico Pache <npache@redhat.com>
> Date: Tue, 2 Jun 2026 10:26:18 -0600
> Subject: [PATCH] convert from stack to forward loop
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 96 ++++++++-----------------------------------------
>  1 file changed, 15 insertions(+), 81 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 498eba009751..6de935e76ceb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
>  #define KHUGEPAGED_MIN_MTHP_ORDER	2
> -/*
> - * mthp_collapse() does an iterative DFS over a binary tree, from
> - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> - * size needed for a DFS on a binary tree is height + 1, where
> - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> - *
> - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> - */
> -#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> -
> -/*
> - * Defines a range of PTE entries in a PTE page table which are being
> - * considered for mTHP collapse.
> - *
> - * @offset: the offset of the first PTE entry in a PMD range.
> - * @order: the order of the PTE entries being considered for collapse.
> - */
> -struct mthp_range {
> -	u16 offset;
> -	u8 order;
> -};
>
>  struct collapse_control {
>  	bool is_khugepaged;
> @@ -137,7 +115,6 @@ struct collapse_control {
>
>  	/* Each bit represents a single occupied (!none/zero) page. */
>  	DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> -	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>  };
>
>  /**
> @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>
> -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> -				     u16 offset, u8 order)
> -{
> -	const int size = *stack_size;
> -	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> -
> -	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> -	stack->order = order;
> -	stack->offset = offset;
> -	(*stack_size)++;
> -}
> -
> -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> -						 int *stack_size)
> -{
> -	const int size = *stack_size;
> -
> -	VM_WARN_ON_ONCE(size <= 0);
> -	(*stack_size)--;
> -	return cc->mthp_bitmap_stack[size - 1];
> -}
> -
>  /*
>   * mthp_collapse() consumes the bitmap that is generated during
>   * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>   *
>   * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> - * of variables (offset, order), indicating the number of PTEs from the start
> - * of the PMD, and the order of the potential collapse candidate respectively.
> - * We start at the PMD order and check if it is eligible for collapse; if not,
> - * we add two entries to the stack at a lower order to represent the left and
> - * right halves of the PTE page table we are examining.
> - *
> - *                         offset       mid_offset
> - *                         |         |
> - *                         |         |
> - *                         v         v
> - *      --------------------------------------
> - *      |       cc->mthp_present_ptes         |
> - *      --------------------------------------
> - *                         <-------><------->
> - *                          order-1  order-1
> + * page. We start at the PMD order and check if it is eligible for collapse;
> + * if not, we check the left and right halves of the PTE page table we are
> + * examining at a lower order.

Yeah this is not good enough sorry, before there was some kind of explanation of
the algortihm, just because you can explain the _code_ more simply, that's not
very useful.

I had to sit down and spend quite a bit of time to figure out how the actual
output looks so I think that should be explained.

>   *
>   * For each of these, we determine how many PTE entries are occupied in the
>   * range of PTE entries we propose to collapse, then we compare this to a
> @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>  {
>  	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>  	enum scan_result last_result = SCAN_FAIL;
> -	int collapsed = 0, stack_size = 0;
> +	int collapsed = 0;
>  	bool alloc_failed = false;
>  	unsigned long collapse_address;
> -	struct mthp_range range;
> -	u16 offset;
> -	u8 order;
> +	unsigned int offset = 0;
> +	unsigned int order = HPAGE_PMD_ORDER;
>
> -	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>
> -	while (stack_size) {
> -		range = collapse_mthp_stack_pop(cc, &stack_size);
> -		order = range.order;
> -		offset = range.offset;
> +	while (offset < HPAGE_PMD_NR) {
>  		nr_ptes = 1UL << order;
>
>  		if (!test_bit(order, &enabled_orders))
>  			goto next_order;
>
>  		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> -
>  		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>  						      offset + nr_ptes);
>
> @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>  				collapsed += nr_ptes;
>  				fallthrough;
>  			case SCAN_PTE_MAPPED_HUGEPAGE:
> -				continue;
> +				goto next_offset;
>  			/* Cases where lower orders might still succeed */
>  			case SCAN_ALLOC_HUGE_PAGE_FAIL:
>  				alloc_failed = true;
> @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>  		}
>

This obviously needs some comments describing what you're doing here. I think
David said so too.

>  next_order:
> -		if ((BIT(order) - 1) & enabled_orders) {
> -			const u8 next_order = order - 1;
> -			const u16 mid_offset = offset + (nr_ptes / 2);
> -
> -			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> -						 next_order);
> -			collapse_mthp_stack_push(cc, &stack_size, offset,
> -						 next_order);
> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> +			(BIT(order) - 1) & enabled_orders) {

Wait, if I disable an order this changes the way we get mTHP doesn't it?

Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get:

9->8->7->6->5->5->6->5->5->7

And we simply can't get order-3 no?

This seems broken doesn't it? Maybe I'm missing something?


> +			order = order - 1;

order--?

> +			continue;
>  		}
> +next_offset:
> +		offset += nr_ptes;
> +		order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);

Also wouldn't, in the case where an enabled order check above skips an order--,
we could have offset=0 here and end up just looping around checking from (0,
HPAGE_PMD_ORDER) again? That also seems broken?

Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined
behaviour no?

https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5
Says as much?

I guess we're assuming we're not going to get to 0 here, but that could do with
a comment or a VM_WARN_ON_ONCE() at least.

Also why aren't we making this a function?

static inline unsigned int max_order_from_offset(unsigned int offset)
{
	if (!offset)
		return HPAGE_PMD_ORDER;

	return __ffs(offset);
}

Though __ffs() works on unsigned long... probably... OK?

>  	}
>  done:
>  	if (collapsed)
> --
> 2.54.0
>
>
>
> >
> > [...]
> >
> >>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
> >>>>       nodes_clear(cc->alloc_nmask);
> >>>> +
> >>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> >>>> +
> >>>> +     /*
> >>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >>>> +      * scan all pages to populate the bitmap for mTHP collapse.
> >>>> +      */
> >>>
> >>> You should note here, that we re-verify in mthp_collapse().
> >>>
> >>> But the question is, whether we should relocate the check completely into
> >>> mthp_collapse(), instead of conditionally duplicating it.
> >>>
> >>> What speaks against always populating the bitmap and making the decision in
> >>> mthp_collapse()?
> >>>
> >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> >>> I am not sure if scanning some more page table entries is really that critical here.
> >>
> >> Someone asked me to preserve the legacy behavior (PMD only). Although
> >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> >> have to do 511 iterations for no reason if PMD collapse is the only
> >> enabled order rather than bailing immediately.
> >>
> >> I'm ok with dropping it, but I think its the correct approach (despite
> >> the extra complexity). @Usama Arif brought up this point here
> >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> >
> > We talk about regressions, but I am not sure if we care about scanning speed
> > within a page table that much?
> >
> > After all, we locked it and already read some entries.
> >
> > Having the same check at two places to optimize for PMD order might right now
> > feel like a good optimization, but likely an irrelevant one in a near future?
> >
> > Anyhow, won't push back, as long as we document why we are special casing things
> > here.
> >
>

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-04 14:14           ` Lorenzo Stoakes
@ 2026-06-04 14:19             ` Lorenzo Stoakes
  0 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 14:19 UTC (permalink / raw)
  To: Nico Pache
  Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, lance.yang, liam,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, vbabka, vishal.moola, wangkefeng.wang,
	will, willy, yang, ying.huang, ziy, zokeefe, Usama Arif,
	usamaarif642

On Thu, Jun 04, 2026 at 03:14:59PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote:
> >
> >
> > On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> > >>>
> > >>> Reading this, it is unclear why exactly do we need the stack.
> > >>
> > >> So I looked into your items below. It seems logical, and I think it
> > >> works the same way; however, your method seems slightly harder to
> > >> understand due to all the edge cases and more error-prone to future
> > >> changes (the stack holds implicit knowledge of the offset/order that
> > >> must now be tracked in the edge cases).
> > >>
> > >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> > >> worth saving that small amount of memory. Although we would also be
> > >> getting rid of (3?) functions, so both approaches have pros and cons.
> > >
> > > I consider a simple forward loop over the offset ... less complexity compared to
> > > a stack structure :)
> > >
> > >>
> > >> I will implement a patch comparing your solution against mine and send
> > >> it here, then we can decide which approach is better.
> > >
> > > Right, throw it over the fence and I'll see how to improve it further.
> >
> > Ok heres what the diff looks like on top of my V19.
> >
> > you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
> >
> > So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.
> >
> >
> > From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> > From: Nico Pache <npache@redhat.com>
> > Date: Tue, 2 Jun 2026 10:26:18 -0600
> > Subject: [PATCH] convert from stack to forward loop
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 96 ++++++++-----------------------------------------
> >  1 file changed, 15 insertions(+), 81 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 498eba009751..6de935e76ceb 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> >  #define KHUGEPAGED_MIN_MTHP_ORDER	2
> > -/*
> > - * mthp_collapse() does an iterative DFS over a binary tree, from
> > - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > - * size needed for a DFS on a binary tree is height + 1, where
> > - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > - *
> > - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> > - */
> > -#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > -
> > -/*
> > - * Defines a range of PTE entries in a PTE page table which are being
> > - * considered for mTHP collapse.
> > - *
> > - * @offset: the offset of the first PTE entry in a PMD range.
> > - * @order: the order of the PTE entries being considered for collapse.
> > - */
> > -struct mthp_range {
> > -	u16 offset;
> > -	u8 order;
> > -};
> >
> >  struct collapse_control {
> >  	bool is_khugepaged;
> > @@ -137,7 +115,6 @@ struct collapse_control {
> >
> >  	/* Each bit represents a single occupied (!none/zero) page. */
> >  	DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> > -	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> >  };
> >
> >  /**
> > @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >  	return result;
> >  }
> >
> > -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > -				     u16 offset, u8 order)
> > -{
> > -	const int size = *stack_size;
> > -	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > -
> > -	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > -	stack->order = order;
> > -	stack->offset = offset;
> > -	(*stack_size)++;
> > -}
> > -
> > -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > -						 int *stack_size)
> > -{
> > -	const int size = *stack_size;
> > -
> > -	VM_WARN_ON_ONCE(size <= 0);
> > -	(*stack_size)--;
> > -	return cc->mthp_bitmap_stack[size - 1];
> > -}
> > -
> >  /*
> >   * mthp_collapse() consumes the bitmap that is generated during
> >   * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> >   *
> >   * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> > - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> > - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> > - * of variables (offset, order), indicating the number of PTEs from the start
> > - * of the PMD, and the order of the potential collapse candidate respectively.
> > - * We start at the PMD order and check if it is eligible for collapse; if not,
> > - * we add two entries to the stack at a lower order to represent the left and
> > - * right halves of the PTE page table we are examining.
> > - *
> > - *                         offset       mid_offset
> > - *                         |         |
> > - *                         |         |
> > - *                         v         v
> > - *      --------------------------------------
> > - *      |       cc->mthp_present_ptes         |
> > - *      --------------------------------------
> > - *                         <-------><------->
> > - *                          order-1  order-1
> > + * page. We start at the PMD order and check if it is eligible for collapse;
> > + * if not, we check the left and right halves of the PTE page table we are
> > + * examining at a lower order.
>
> Yeah this is not good enough sorry, before there was some kind of explanation of
> the algortihm, just because you can explain the _code_ more simply, that's not
> very useful.
>
> I had to sit down and spend quite a bit of time to figure out how the actual
> output looks so I think that should be explained.
>
> >   *
> >   * For each of these, we determine how many PTE entries are occupied in the
> >   * range of PTE entries we propose to collapse, then we compare this to a
> > @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> >  {
> >  	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> >  	enum scan_result last_result = SCAN_FAIL;
> > -	int collapsed = 0, stack_size = 0;
> > +	int collapsed = 0;
> >  	bool alloc_failed = false;
> >  	unsigned long collapse_address;
> > -	struct mthp_range range;
> > -	u16 offset;
> > -	u8 order;
> > +	unsigned int offset = 0;
> > +	unsigned int order = HPAGE_PMD_ORDER;
> >
> > -	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >
> > -	while (stack_size) {
> > -		range = collapse_mthp_stack_pop(cc, &stack_size);
> > -		order = range.order;
> > -		offset = range.offset;
> > +	while (offset < HPAGE_PMD_NR) {
> >  		nr_ptes = 1UL << order;
> >
> >  		if (!test_bit(order, &enabled_orders))
> >  			goto next_order;
> >
> >  		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> > -
> >  		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> >  						      offset + nr_ptes);
> >
> > @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> >  				collapsed += nr_ptes;
> >  				fallthrough;
> >  			case SCAN_PTE_MAPPED_HUGEPAGE:
> > -				continue;
> > +				goto next_offset;
> >  			/* Cases where lower orders might still succeed */
> >  			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> >  				alloc_failed = true;
> > @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> >  		}
> >
>
> This obviously needs some comments describing what you're doing here. I think
> David said so too.
>
> >  next_order:
> > -		if ((BIT(order) - 1) & enabled_orders) {
> > -			const u8 next_order = order - 1;
> > -			const u16 mid_offset = offset + (nr_ptes / 2);
> > -
> > -			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > -						 next_order);
> > -			collapse_mthp_stack_push(cc, &stack_size, offset,
> > -						 next_order);
> > +		if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> > +			(BIT(order) - 1) & enabled_orders) {
>
> Wait, if I disable an order this changes the way we get mTHP doesn't it?
>
> Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get:
>
> 9->8->7->6->5->5->6->5->5->7
>
> And we simply can't get order-3 no?
>
> This seems broken doesn't it? Maybe I'm missing something?

OK it's the way this is written, very confusing. I do not know why you are
writing this code in this 'compressed' way.

(1 << order) - 1) & enabled_orders is to see if there's _any others_ to check.

>
>
> > +			order = order - 1;
>
> order--?
>
> > +			continue;
> >  		}
> > +next_offset:
> > +		offset += nr_ptes;
> > +		order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
>
> Also wouldn't, in the case where an enabled order check above skips an order--,
> we could have offset=0 here and end up just looping around checking from (0,
> HPAGE_PMD_ORDER) again? That also seems broken?

Yeah sorry the offset += nr_ptes fixes that anyway.

And the fact it's a mask check above makes this OK.

So the logic seems probably fine but it needs to be clearer.

>
> Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined
> behaviour no?
>
> https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5
> Says as much?
>
> I guess we're assuming we're not going to get to 0 here, but that could do with
> a comment or a VM_WARN_ON_ONCE() at least.
>
> Also why aren't we making this a function?
>
> static inline unsigned int max_order_from_offset(unsigned int offset)
> {
> 	if (!offset)
> 		return HPAGE_PMD_ORDER;
>
> 	return __ffs(offset);
> }
>
> Though __ffs() works on unsigned long... probably... OK?
>
> >  	}
> >  done:
> >  	if (collapsed)
> > --
> > 2.54.0
> >
> >
> >
> > >
> > > [...]
> > >
> > >>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> > >>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
> > >>>>       nodes_clear(cc->alloc_nmask);
> > >>>> +
> > >>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > >>>> +
> > >>>> +     /*
> > >>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > >>>> +      * scan all pages to populate the bitmap for mTHP collapse.
> > >>>> +      */
> > >>>
> > >>> You should note here, that we re-verify in mthp_collapse().
> > >>>
> > >>> But the question is, whether we should relocate the check completely into
> > >>> mthp_collapse(), instead of conditionally duplicating it.
> > >>>
> > >>> What speaks against always populating the bitmap and making the decision in
> > >>> mthp_collapse()?
> > >>>
> > >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> > >>> I am not sure if scanning some more page table entries is really that critical here.
> > >>
> > >> Someone asked me to preserve the legacy behavior (PMD only). Although
> > >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> > >> have to do 511 iterations for no reason if PMD collapse is the only
> > >> enabled order rather than bailing immediately.
> > >>
> > >> I'm ok with dropping it, but I think its the correct approach (despite
> > >> the extra complexity). @Usama Arif brought up this point here
> > >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> > >
> > > We talk about regressions, but I am not sure if we care about scanning speed
> > > within a page table that much?
> > >
> > > After all, we locked it and already read some entries.
> > >
> > > Having the same check at two places to optimize for PMD order might right now
> > > feel like a good optimization, but likely an irrelevant one in a near future?
> > >
> > > Anyhow, won't push back, as long as we document why we are special casing things
> > > here.
> > >
> >
>
> Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-01  8:11   ` David Hildenbrand (Arm)
  2026-06-01 12:40     ` Nico Pache
@ 2026-06-04 13:53     ` Lorenzo Stoakes
  2026-06-04 13:59       ` Lorenzo Stoakes
  1 sibling, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 13:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

(Checking the algorithm here)

On Mon, Jun 01, 2026 at 10:11:24AM +0200, David Hildenbrand (Arm) wrote:
> On 5/22/26 17:00, Nico Pache wrote:
>
> Finally time for the core piece :)
>
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),

This is inaccurate, it's only consecutively smaller until you hit smallest then
it starts bumping around 2 -> 3 -> 2 -> 3 -> 2 -> .. -> 4 -> 3 -> 2 -> 3 -> 2 -> 4 -> etc.

More like consecutively smaller, then always trying for the smallest possible
fit?

Would be good to describe why we do this, presumably to get a best _fit_?

> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >     2) pop the stack
> >     3) check if the number of set bits in that (offset,order) pair
> >        statisfy the max_ptes_none threshold for that order
> >     4) if yes, attempt collapse
> >     5) if no (or collapse fails), push two new stack items representing
> >        the left and right halves of the current bitmap range, at the
> >        next lower order

I notice the ordering is wrong here, you actualy push the mid_offset first then
the offset (e.g. 'right', then 'left'):

			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
						 next_order);
			collapse_mthp_stack_push(cc, &stack_size, offset,
						 next_order);

So that way you are popping the 'left' first then the 'right'.

So seems you'll get:

stack={0, 9}

Pop (0, order=9):

	|----------------------------------------|
	|########################################|
	|----------------------------------------|

stack={256, 8}, {0, 8}

Pop (0, order=8):

	|--------------------|-------------------|
	|####################|                   |
	|--------------------|-------------------|


stack={256, 8}, {128, 7}, {0, 7}

Pop (0, order=7):

	|----------|-----------------------------|
	|##########|                             |
	|----------|-----------------------------|

stack={256, 8}, {128, 7}, {64, 6}, {0, 6}

Pop (0, order=6):

	|----|-----------------------------------|
	|####|                                   |
	|----|-----------------------------------|

...

stack={256, 8}, ..., { 8, 3 }, {0, 2}

Pop (0, order=2):

	|-|--------------------------------------|
	|#|                                      |
	|-|--------------------------------------|

Then finally :) we get the offsets :)

stack={256, 8}, ..., {8, 3}, {4, 2}

Pop (4, order=2):

	|-|-|------------------------------------|
	| |#|                                    |
	|-|-|------------------------------------|

stack={256, 8}, ..., { 12, 2 }, {8, 3}

Pop (8, order=3):

	|---|--|---------------------------------|
	|   |##|                                 |
	|---|--|---------------------------------|

stack={256, 8}, ..., { 12, 2 }, {12, 2}, {8, 2}

Pop (8, order=2):

	|---|-|----------------------------------|
	|   |#|                                  |
	|---|-|----------------------------------|

etc.


It seems to me that you're going to keep iterating down until you match an mTHP
when a larger mTHP could have been had?

So we're going:

order 9 -> 8 -> 7 -> 6 -> ... -> 2 -> 3 -> 2 -> 4 -> 3 -> 2

I guess the point is to avoid only getting the largest possible




I guess if we did try to get the largest then we'd only get 2 of the largest
possible then exhaust the whole PMD, should a PMD-sized entry not be possble.

> >     6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> >                             offset   mid_offset
> >                             |        |
> >                             |        |
> >                             v        v
> >           ____________________________________
> >          |          PTE Page Table            |
> >          --------------------------------------
> > 			    <-------><------->
> >                              order-1  order-1
>
>
> Reading this, it is unclear why exactly do we need the stack.
>
> Why can't you work with offset + cur_order?
>
> Initially,
>
> 	offset = 0;
> 	cur_order = HPAGE_PMD_ORDER;
>
> If collapse succeeded, advance to next range.
> If collapse failed, try next smaller order, keeping offset unchanged.
>
> 	if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> 		/* Try next smaller order. */
> 		cur_order = cur_order - 1;

OK this matches the stack for the 0 offset entries...

> 	} else {
> 		/* Skip to next chunk. */
> 		offset += 1 << cur_order;
> 		cur_order = max_order_from_offset(offset);

Then 1 << 2 -> 4 so go to offset=4.

max_order_from_offset(4) = 2. so (4, offset=2) same as above.

Then we'd loop back here and go to offset = 8, and max_order_from_offset(8) = 3

And, yeah this seems equivalent.

> 	}

>
> Of course, handling disabled orders. max_order_from_offset() is rather trivial
> (natural buddy order, capped at HPAGE_PMD_ORDER).

Something like?

static unsigned long max_order_from_offset(unsigned long offset)
{
	if (!offset)
	   return HPAGE_PMD_ORDER;

	return ilog2(offset);
}

>
> What's the benefit of the stack?

Yeah it seems equivalent. Good idea!

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-04 13:53     ` Lorenzo Stoakes
@ 2026-06-04 13:59       ` Lorenzo Stoakes
  0 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 13:59 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Thu, Jun 04, 2026 at 02:53:39PM +0100, Lorenzo Stoakes wrote:
> (Checking the algorithm here)
>
> On Mon, Jun 01, 2026 at 10:11:24AM +0200, David Hildenbrand (Arm) wrote:
> > On 5/22/26 17:00, Nico Pache wrote:
> >
> > Finally time for the core piece :)
> >
> > > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > > main scanning logic using a bitmap to track occupied pages and a stack
> > > structure that allows us to find optimal collapse sizes.
> > >
> > > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > > scanning phase (mmap_read_lock) that determines a potential PMD
> > > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > > phase (mmap_write_lock).
> > >
> > > To enabled mTHP collapse we make the following changes:
> > >
> > > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > > orders are enabled, we remove the restriction of max_ptes_none during the
> > > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > > have scanned the full PMD range and updated the bitmap to track occupied
> > > pages, we use the bitmap to find the optimal mTHP size.
> > >
> > > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > > and determine the best eligible order for the collapse. A stack structure
> > > is used instead of traditional recursion to manage the search. This also
> > > prevents a traditional recursive approach when the kernel stack struct is
> > > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > > find the highest order mTHPs that satisfy the collapse criteria. We start
> > > by attempting the PMD order, then moved on the consecutively lower orders
> > > (mTHP collapse). The stack maintains a pair of variables (offset, order),
>
> This is inaccurate, it's only consecutively smaller until you hit smallest then
> it starts bumping around 2 -> 3 -> 2 -> 3 -> 2 -> .. -> 4 -> 3 -> 2 -> 3 -> 2 -> 4 -> etc.
>
> More like consecutively smaller, then always trying for the smallest possible
> fit?
>
> Would be good to describe why we do this, presumably to get a best _fit_?
>
> > > indicating the number of PTEs from the start of the PMD, and the order of
> > > the potential collapse candidate.
> > >
> > > The algorithm for consuming the bitmap works as such:
> > >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> > >     2) pop the stack
> > >     3) check if the number of set bits in that (offset,order) pair
> > >        statisfy the max_ptes_none threshold for that order
> > >     4) if yes, attempt collapse
> > >     5) if no (or collapse fails), push two new stack items representing
> > >        the left and right halves of the current bitmap range, at the
> > >        next lower order
>
> I notice the ordering is wrong here, you actualy push the mid_offset first then
> the offset (e.g. 'right', then 'left'):
>
> 			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> 						 next_order);
> 			collapse_mthp_stack_push(cc, &stack_size, offset,
> 						 next_order);
>
> So that way you are popping the 'left' first then the 'right'.
>
> So seems you'll get:
>
> stack={0, 9}
>
> Pop (0, order=9):
>
> 	|----------------------------------------|
> 	|########################################|
> 	|----------------------------------------|
>
> stack={256, 8}, {0, 8}
>
> Pop (0, order=8):
>
> 	|--------------------|-------------------|
> 	|####################|                   |
> 	|--------------------|-------------------|
>
>
> stack={256, 8}, {128, 7}, {0, 7}
>
> Pop (0, order=7):
>
> 	|----------|-----------------------------|
> 	|##########|                             |
> 	|----------|-----------------------------|
>
> stack={256, 8}, {128, 7}, {64, 6}, {0, 6}
>
> Pop (0, order=6):
>
> 	|----|-----------------------------------|
> 	|####|                                   |
> 	|----|-----------------------------------|
>
> ...
>
> stack={256, 8}, ..., { 8, 3 }, {0, 2}
>
> Pop (0, order=2):
>
> 	|-|--------------------------------------|
> 	|#|                                      |
> 	|-|--------------------------------------|
>
> Then finally :) we get the offsets :)
>
> stack={256, 8}, ..., {8, 3}, {4, 2}
>
> Pop (4, order=2):
>
> 	|-|-|------------------------------------|
> 	| |#|                                    |
> 	|-|-|------------------------------------|
>
> stack={256, 8}, ..., { 12, 2 }, {8, 3}

(Shouldn't be {12, 2} there :)

> Pop (8, order=3):
>
> 	|---|--|---------------------------------|
> 	|   |##|                                 |
> 	|---|--|---------------------------------|
>
> stack={256, 8}, ..., { 12, 2 }, {12, 2}, {8, 2}

(Shouldn't duplicate {12, 2} there :)

>
> Pop (8, order=2):
>
> 	|---|-|----------------------------------|
> 	|   |#|                                  |
> 	|---|-|----------------------------------|
>
> etc.
>
>
> It seems to me that you're going to keep iterating down until you match an mTHP
> when a larger mTHP could have been had?
>
> So we're going:
>
> order 9 -> 8 -> 7 -> 6 -> ... -> 2 -> 3 -> 2 -> 4 -> 3 -> 2
>
> I guess the point is to avoid only getting the largest possible
>
>
>
>
> I guess if we did try to get the largest then we'd only get 2 of the largest
> possible then exhaust the whole PMD, should a PMD-sized entry not be possble.
>
> > >     6) repeat at step (2) until stack is empty.
> > >
> > > Below is a diagram representing the algorithm and stack items:
> > >
> > >                             offset   mid_offset
> > >                             |        |
> > >                             |        |
> > >                             v        v
> > >           ____________________________________
> > >          |          PTE Page Table            |
> > >          --------------------------------------
> > > 			    <-------><------->
> > >                              order-1  order-1
> >
> >
> > Reading this, it is unclear why exactly do we need the stack.
> >
> > Why can't you work with offset + cur_order?
> >
> > Initially,
> >
> > 	offset = 0;
> > 	cur_order = HPAGE_PMD_ORDER;
> >
> > If collapse succeeded, advance to next range.
> > If collapse failed, try next smaller order, keeping offset unchanged.
> >
> > 	if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> > 		/* Try next smaller order. */
> > 		cur_order = cur_order - 1;
>
> OK this matches the stack for the 0 offset entries...
>
> > 	} else {
> > 		/* Skip to next chunk. */
> > 		offset += 1 << cur_order;
> > 		cur_order = max_order_from_offset(offset);
>
> Then 1 << 2 -> 4 so go to offset=4.
>
> max_order_from_offset(4) = 2. so (4, offset=2) same as above.
>
> Then we'd loop back here and go to offset = 8, and max_order_from_offset(8) = 3
>
> And, yeah this seems equivalent.
>
> > 	}
>
> >
> > Of course, handling disabled orders. max_order_from_offset() is rather trivial
> > (natural buddy order, capped at HPAGE_PMD_ORDER).
>
> Something like?
>
> static unsigned long max_order_from_offset(unsigned long offset)
> {
> 	if (!offset)
> 	   return HPAGE_PMD_ORDER;
>
> 	return ilog2(offset);

Oops, we need the LSB so ffs.

> }
>
> >
> > What's the benefit of the stack?
>
> Yeah it seems equivalent. Good idea!
>
> Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
                     ` (2 preceding siblings ...)
  2026-06-01  8:11   ` David Hildenbrand (Arm)
@ 2026-06-04 14:45   ` Lorenzo Stoakes
  2026-06-05 11:07     ` Nico Pache
  3 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
>                             offset   mid_offset
>                             |        |
>                             |        |
>                             v        v
>           ____________________________________
>          |          PTE Page Table            |
>          --------------------------------------
> 			    <-------><------->
>                              order-1  order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order
>
> Any other max_ptes_none value will emit a warning and default mTHP
> collapse to max_ptes_none=0. There should be no behavior change for PMD
> collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 172 insertions(+), 9 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 64ceebc9d8a7..d3d7db8be26c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>
> @@ -110,6 +134,12 @@ struct collapse_control {
>
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* Each bit represents a single occupied (!none/zero) page. */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> +	/* A mask of the current range being considered for mTHP collapse. */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>  };
>
>  /**
> @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				     u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> +						 int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> +						u16 offset, unsigned int nr_ptes)
> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, int referenced, int unmapped,
> +		struct collapse_control *cc, unsigned long enabled_orders)
> +{
> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> +	int collapsed = 0, stack_size = 0;
> +	unsigned long collapse_address;
> +	struct mthp_range range;
> +	u16 offset;
> +	u8 order;
> +
> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size) {
> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_ptes = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +
> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> +							       nr_ptes);
> +
> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_ptes;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if ((BIT(order) - 1) & enabled_orders) {
> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_ptes / 2);
> +
> +			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> +						 next_order);
> +			collapse_mthp_stack_push(cc, &stack_size, offset,
> +						 next_order);
> +		}
> +	}
> +	return collapsed;
> +}
> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> +	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>  	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	pte_t *pte, *_pte, pteval;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>
> @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> +	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */
> +	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;

Hmm, this is a bit odd, what if the user set max_ptes_none = 0?

I assume we handle the 0/511 thing elsewhere?

> +
>  	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>  	if (!pte) {
>  		cc->progress++;
> @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;
> +		addr = start_addr + i * PAGE_SIZE;
> +		pteval = ptep_get(_pte);
> +
>  		cc->progress++;
>
> -		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>
> +		/* Set bit for occupied pages */
> +		__set_bit(i, cc->mthp_bitmap);
>  		/*
>  		 * Record which node the original page is from and save this
>  		 * information to cc->node_load[].
> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  	if (result == SCAN_SUCCEED) {
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, HPAGE_PMD_ORDER);
> -		/* collapse_huge_page will return with the mmap_lock released */
> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> +					     unmapped, cc, enabled_orders);

I guess mthp_collapse() also does PMD collapse if only PMD is enabled?

It feels like this name is a bit confusing then :)

But I guess we can do a follow up to think of a better name possibly.

> +		/* mmap_lock was released above, set lock_dropped */
>  		*lock_dropped = true;
> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.54.0
>

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-04 14:45   ` Lorenzo Stoakes
@ 2026-06-05 11:07     ` Nico Pache
  2026-06-05 11:08       ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-06-05 11:07 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Thu, Jun 4, 2026 at 8:45 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),
> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >     2) pop the stack
> >     3) check if the number of set bits in that (offset,order) pair
> >        statisfy the max_ptes_none threshold for that order
> >     4) if yes, attempt collapse
> >     5) if no (or collapse fails), push two new stack items representing
> >        the left and right halves of the current bitmap range, at the
> >        next lower order
> >     6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> >                             offset   mid_offset
> >                             |        |
> >                             |        |
> >                             v        v
> >           ____________________________________
> >          |          PTE Page Table            |
> >          --------------------------------------
> >                           <-------><------->
> >                              order-1  order-1
> >
> > mTHP collapses reject regions containing swapped out or shared pages.
> > This is because adding new entries can lead to new none pages, and these
> > may lead to constant promotion into a higher order mTHP. A similar
> > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > introducing at least 2x the number of pages, and on a future scan will
> > satisfy the promotion condition once again. This issue is prevented via
> > the collapse_max_ptes_none() function which imposes the max_ptes_none
> > restrictions above.
> >
> > We currently only support mTHP collapse for max_ptes_none values of 0
> > and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >
> >     - max_ptes_none=0: Never introduce new empty pages during collapse
> >     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >       available mTHP order
> >
> > Any other max_ptes_none value will emit a warning and default mTHP
> > collapse to max_ptes_none=0. There should be no behavior change for PMD
> > collapse.
> >
> > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >
> > Currently madv_collapse is not supported and will only attempt PMD
> > collapse.
> >
> > We can also remove the check for is_khugepaged inside the PMD scan as
> > the collapse_max_ptes_none() function handles this logic now.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 172 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 64ceebc9d8a7..d3d7db8be26c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > +/*
> > + * mthp_collapse() does an iterative DFS over a binary tree, from
> > + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > + * size needed for a DFS on a binary tree is height + 1, where
> > + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > + *
> > + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> > + */
> > +#define MTHP_STACK_SIZE      (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > +
> > +/*
> > + * Defines a range of PTE entries in a PTE page table which are being
> > + * considered for mTHP collapse.
> > + *
> > + * @offset: the offset of the first PTE entry in a PMD range.
> > + * @order: the order of the PTE entries being considered for collapse.
> > + */
> > +struct mthp_range {
> > +     u16 offset;
> > +     u8 order;
> > +};
> > +
> >  struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -110,6 +134,12 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /* Each bit represents a single occupied (!none/zero) page. */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> > +     /* A mask of the current range being considered for mTHP collapse. */
> > +     DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> >  };
> >
> >  /**
> > @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >       return result;
> >  }
> >
> > +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > +                                  u16 offset, u8 order)
> > +{
> > +     const int size = *stack_size;
> > +     struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > +
> > +     VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > +     stack->order = order;
> > +     stack->offset = offset;
> > +     (*stack_size)++;
> > +}
> > +
> > +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > +                                              int *stack_size)
> > +{
> > +     const int size = *stack_size;
> > +
> > +     VM_WARN_ON_ONCE(size <= 0);
> > +     (*stack_size)--;
> > +     return cc->mthp_bitmap_stack[size - 1];
> > +}
> > +
> > +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> > +                                             u16 offset, unsigned int nr_ptes)
> > +{
> > +     bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> > +     return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +}
> > +
> > +/*
> > + * mthp_collapse() consumes the bitmap that is generated during
> > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > + *
> > + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> > + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> > + * of the bitmap for collapse eligibility. The stack maintains a pair of
> > + * variables (offset, order), indicating the number of PTEs from the start of
> > + * the PMD, and the order of the potential collapse candidate respectively. We
> > + * start at the PMD order and check if it is eligible for collapse; if not, we
> > + * add two entries to the stack at a lower order to represent the left and right
> > + * halves of the PTE page table we are examining.
> > + *
> > + *                         offset       mid_offset
> > + *                         |         |
> > + *                         |         |
> > + *                         v         v
> > + *      --------------------------------------
> > + *      |          cc->mthp_bitmap            |
> > + *      --------------------------------------
> > + *                         <-------><------->
> > + *                          order-1  order-1
> > + *
> > + * For each of these, we determine how many PTE entries are occupied in the
> > + * range of PTE entries we propose to collapse, then we compare this to a
> > + * threshold number of PTE entries which would need to be occupied for a
> > + * collapse to be permitted at that order (accounting for max_ptes_none).
> > + *
> > + * If a collapse is permitted, we attempt to collapse the PTE range into a
> > + * mTHP.
> > + */
> > +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> > +             unsigned long address, int referenced, int unmapped,
> > +             struct collapse_control *cc, unsigned long enabled_orders)
> > +{
> > +     unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > +     int collapsed = 0, stack_size = 0;
> > +     unsigned long collapse_address;
> > +     struct mthp_range range;
> > +     u16 offset;
> > +     u8 order;
> > +
> > +     collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> > +
> > +     while (stack_size) {
> > +             range = collapse_mthp_stack_pop(cc, &stack_size);
> > +             order = range.order;
> > +             offset = range.offset;
> > +             nr_ptes = 1UL << order;
> > +
> > +             if (!test_bit(order, &enabled_orders))
> > +                     goto next_order;
> > +
> > +             max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> > +
> > +             nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > +                                                            nr_ptes);
> > +
> > +             if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> > +                     int ret;
> > +
> > +                     collapse_address = address + offset * PAGE_SIZE;
> > +                     ret = collapse_huge_page(mm, collapse_address, referenced,
> > +                                              unmapped, cc, order);
> > +                     if (ret == SCAN_SUCCEED) {
> > +                             collapsed += nr_ptes;
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +next_order:
> > +             if ((BIT(order) - 1) & enabled_orders) {
> > +                     const u8 next_order = order - 1;
> > +                     const u16 mid_offset = offset + (nr_ptes / 2);
> > +
> > +                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > +                                              next_order);
> > +                     collapse_mthp_stack_push(cc, &stack_size, offset,
> > +                                              next_order);
> > +             }
> > +     }
> > +     return collapsed;
> > +}
> > +
> >  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> >               bool *lock_dropped, struct collapse_control *cc)
> >  {
> > -     const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> > +     unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > +     enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> >       pmd_t *pmd;
> > -     pte_t *pte, *_pte;
> > -     int none_or_zero = 0, shared = 0, referenced = 0;
> > +     pte_t *pte, *_pte, pteval;
> > +     int i;
> > +     int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> >       enum scan_result result = SCAN_FAIL;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long addr;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >
> > @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> > +
> > +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > +
> > +     /*
> > +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > +      * scan all pages to populate the bitmap for mTHP collapse.
> > +      */
> > +     if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> > +             max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>
> Hmm, this is a bit odd, what if the user set max_ptes_none = 0?

We'd still want to scan the full PMD to populate the bitmap. That way
we can find the smaller orders that contain 0 none/zero PTEs.

>
> I assume we handle the 0/511 thing elsewhere?

Yes in the bitmap weight check and in collapse_huge_page_isolate()

>
> > +
> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >       if (!pte) {
> >               cc->progress++;
> > @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, addr += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             _pte = pte + i;
> > +             addr = start_addr + i * PAGE_SIZE;
> > +             pteval = ptep_get(_pte);
> > +
> >               cc->progress++;
> >
> > -             pte_t pteval = ptep_get(_pte);
> >               if (pte_none_or_zero(pteval)) {
> >                       if (++none_or_zero > max_ptes_none) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >
> > +             /* Set bit for occupied pages */
> > +             __set_bit(i, cc->mthp_bitmap);
> >               /*
> >                * Record which node the original page is from and save this
> >                * information to cc->node_load[].
> > @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> > -             result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc, HPAGE_PMD_ORDER);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > +             nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> > +                                          unmapped, cc, enabled_orders);
>
> I guess mthp_collapse() also does PMD collapse if only PMD is enabled?
>
> It feels like this name is a bit confusing then :)
>
> But I guess we can do a follow up to think of a better name possibly.
>
> > +             /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> > +             result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > --
> > 2.54.0
> >
>
> Thanks, Lorenzo
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
  2026-06-05 11:07     ` Nico Pache
@ 2026-06-05 11:08       ` Nico Pache
  0 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-05 11:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Fri, Jun 5, 2026 at 5:07 AM Nico Pache <npache@redhat.com> wrote:
>
> On Thu, Jun 4, 2026 at 8:45 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> > > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > > main scanning logic using a bitmap to track occupied pages and a stack
> > > structure that allows us to find optimal collapse sizes.
> > >
> > > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > > scanning phase (mmap_read_lock) that determines a potential PMD
> > > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > > phase (mmap_write_lock).
> > >
> > > To enabled mTHP collapse we make the following changes:
> > >
> > > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > > orders are enabled, we remove the restriction of max_ptes_none during the
> > > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > > have scanned the full PMD range and updated the bitmap to track occupied
> > > pages, we use the bitmap to find the optimal mTHP size.
> > >
> > > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > > and determine the best eligible order for the collapse. A stack structure
> > > is used instead of traditional recursion to manage the search. This also
> > > prevents a traditional recursive approach when the kernel stack struct is
> > > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > > find the highest order mTHPs that satisfy the collapse criteria. We start
> > > by attempting the PMD order, then moved on the consecutively lower orders
> > > (mTHP collapse). The stack maintains a pair of variables (offset, order),
> > > indicating the number of PTEs from the start of the PMD, and the order of
> > > the potential collapse candidate.
> > >
> > > The algorithm for consuming the bitmap works as such:
> > >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> > >     2) pop the stack
> > >     3) check if the number of set bits in that (offset,order) pair
> > >        statisfy the max_ptes_none threshold for that order
> > >     4) if yes, attempt collapse
> > >     5) if no (or collapse fails), push two new stack items representing
> > >        the left and right halves of the current bitmap range, at the
> > >        next lower order
> > >     6) repeat at step (2) until stack is empty.
> > >
> > > Below is a diagram representing the algorithm and stack items:
> > >
> > >                             offset   mid_offset
> > >                             |        |
> > >                             |        |
> > >                             v        v
> > >           ____________________________________
> > >          |          PTE Page Table            |
> > >          --------------------------------------
> > >                           <-------><------->
> > >                              order-1  order-1
> > >
> > > mTHP collapses reject regions containing swapped out or shared pages.
> > > This is because adding new entries can lead to new none pages, and these
> > > may lead to constant promotion into a higher order mTHP. A similar
> > > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > > introducing at least 2x the number of pages, and on a future scan will
> > > satisfy the promotion condition once again. This issue is prevented via
> > > the collapse_max_ptes_none() function which imposes the max_ptes_none
> > > restrictions above.
> > >
> > > We currently only support mTHP collapse for max_ptes_none values of 0
> > > and HPAGE_PMD_NR - 1. resulting in the following behavior:
> > >
> > >     - max_ptes_none=0: Never introduce new empty pages during collapse
> > >     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> > >       available mTHP order
> > >
> > > Any other max_ptes_none value will emit a warning and default mTHP
> > > collapse to max_ptes_none=0. There should be no behavior change for PMD
> > > collapse.
> > >
> > > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > > order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> > >
> > > Currently madv_collapse is not supported and will only attempt PMD
> > > collapse.
> > >
> > > We can also remove the check for is_khugepaged inside the PMD scan as
> > > the collapse_max_ptes_none() function handles this logic now.
> > >
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 172 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 64ceebc9d8a7..d3d7db8be26c 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> > >
> > >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> > >
> > > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > > +/*
> > > + * mthp_collapse() does an iterative DFS over a binary tree, from
> > > + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > > + * size needed for a DFS on a binary tree is height + 1, where
> > > + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > > + *
> > > + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > > + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> > > + */
> > > +#define MTHP_STACK_SIZE      (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > > +
> > > +/*
> > > + * Defines a range of PTE entries in a PTE page table which are being
> > > + * considered for mTHP collapse.
> > > + *
> > > + * @offset: the offset of the first PTE entry in a PMD range.
> > > + * @order: the order of the PTE entries being considered for collapse.
> > > + */
> > > +struct mthp_range {
> > > +     u16 offset;
> > > +     u8 order;
> > > +};
> > > +
> > >  struct collapse_control {
> > >       bool is_khugepaged;
> > >
> > > @@ -110,6 +134,12 @@ struct collapse_control {
> > >
> > >       /* nodemask for allocation fallback */
> > >       nodemask_t alloc_nmask;
> > > +
> > > +     /* Each bit represents a single occupied (!none/zero) page. */
> > > +     DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> > > +     /* A mask of the current range being considered for mTHP collapse. */
> > > +     DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > > +     struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> > >  };
> > >
> > >  /**
> > > @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> > >       return result;
> > >  }
> > >
> > > +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > > +                                  u16 offset, u8 order)
> > > +{
> > > +     const int size = *stack_size;
> > > +     struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > > +
> > > +     VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > > +     stack->order = order;
> > > +     stack->offset = offset;
> > > +     (*stack_size)++;
> > > +}
> > > +
> > > +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > > +                                              int *stack_size)
> > > +{
> > > +     const int size = *stack_size;
> > > +
> > > +     VM_WARN_ON_ONCE(size <= 0);
> > > +     (*stack_size)--;
> > > +     return cc->mthp_bitmap_stack[size - 1];
> > > +}
> > > +
> > > +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> > > +                                             u16 offset, unsigned int nr_ptes)
> > > +{
> > > +     bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > > +     bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> > > +     return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > > +}
> > > +
> > > +/*
> > > + * mthp_collapse() consumes the bitmap that is generated during
> > > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > > + *
> > > + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> > > + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> > > + * of the bitmap for collapse eligibility. The stack maintains a pair of
> > > + * variables (offset, order), indicating the number of PTEs from the start of
> > > + * the PMD, and the order of the potential collapse candidate respectively. We
> > > + * start at the PMD order and check if it is eligible for collapse; if not, we
> > > + * add two entries to the stack at a lower order to represent the left and right
> > > + * halves of the PTE page table we are examining.
> > > + *
> > > + *                         offset       mid_offset
> > > + *                         |         |
> > > + *                         |         |
> > > + *                         v         v
> > > + *      --------------------------------------
> > > + *      |          cc->mthp_bitmap            |
> > > + *      --------------------------------------
> > > + *                         <-------><------->
> > > + *                          order-1  order-1
> > > + *
> > > + * For each of these, we determine how many PTE entries are occupied in the
> > > + * range of PTE entries we propose to collapse, then we compare this to a
> > > + * threshold number of PTE entries which would need to be occupied for a
> > > + * collapse to be permitted at that order (accounting for max_ptes_none).
> > > + *
> > > + * If a collapse is permitted, we attempt to collapse the PTE range into a
> > > + * mTHP.
> > > + */
> > > +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> > > +             unsigned long address, int referenced, int unmapped,
> > > +             struct collapse_control *cc, unsigned long enabled_orders)
> > > +{
> > > +     unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > > +     int collapsed = 0, stack_size = 0;
> > > +     unsigned long collapse_address;
> > > +     struct mthp_range range;
> > > +     u16 offset;
> > > +     u8 order;
> > > +
> > > +     collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> > > +
> > > +     while (stack_size) {
> > > +             range = collapse_mthp_stack_pop(cc, &stack_size);
> > > +             order = range.order;
> > > +             offset = range.offset;
> > > +             nr_ptes = 1UL << order;
> > > +
> > > +             if (!test_bit(order, &enabled_orders))
> > > +                     goto next_order;
> > > +
> > > +             max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> > > +
> > > +             nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > > +                                                            nr_ptes);
> > > +
> > > +             if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> > > +                     int ret;
> > > +
> > > +                     collapse_address = address + offset * PAGE_SIZE;
> > > +                     ret = collapse_huge_page(mm, collapse_address, referenced,
> > > +                                              unmapped, cc, order);
> > > +                     if (ret == SCAN_SUCCEED) {
> > > +                             collapsed += nr_ptes;
> > > +                             continue;
> > > +                     }
> > > +             }
> > > +
> > > +next_order:
> > > +             if ((BIT(order) - 1) & enabled_orders) {
> > > +                     const u8 next_order = order - 1;
> > > +                     const u16 mid_offset = offset + (nr_ptes / 2);
> > > +
> > > +                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > > +                                              next_order);
> > > +                     collapse_mthp_stack_push(cc, &stack_size, offset,
> > > +                                              next_order);
> > > +             }
> > > +     }
> > > +     return collapsed;
> > > +}
> > > +
> > >  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >               struct vm_area_struct *vma, unsigned long start_addr,
> > >               bool *lock_dropped, struct collapse_control *cc)
> > >  {
> > > -     const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> > >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> > > +     unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > > +     enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> > >       pmd_t *pmd;
> > > -     pte_t *pte, *_pte;
> > > -     int none_or_zero = 0, shared = 0, referenced = 0;
> > > +     pte_t *pte, *_pte, pteval;
> > > +     int i;
> > > +     int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> > >       enum scan_result result = SCAN_FAIL;
> > >       struct page *page = NULL;
> > >       struct folio *folio = NULL;
> > >       unsigned long addr;
> > > +     unsigned long enabled_orders;
> > >       spinlock_t *ptl;
> > >       int node = NUMA_NO_NODE, unmapped = 0;
> > >
> > > @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >               goto out;
> > >       }
> > >
> > > +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> > >       memset(cc->node_load, 0, sizeof(cc->node_load));
> > >       nodes_clear(cc->alloc_nmask);
> > > +
> > > +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > > +
> > > +     /*
> > > +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > > +      * scan all pages to populate the bitmap for mTHP collapse.
> > > +      */
> > > +     if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> > > +             max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> >
> > Hmm, this is a bit odd, what if the user set max_ptes_none = 0?
>
> We'd still want to scan the full PMD to populate the bitmap. That way
> we can find the smaller orders that contain 0 none/zero PTEs.
>
> >
> > I assume we handle the 0/511 thing elsewhere?
>
> Yes in the bitmap weight check and in collapse_huge_page_isolate()
>
> >
> > > +
> > >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> > >       if (!pte) {
> > >               cc->progress++;
> > > @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >               goto out;
> > >       }
> > >
> > > -     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > > -          _pte++, addr += PAGE_SIZE) {
> > > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > > +             _pte = pte + i;
> > > +             addr = start_addr + i * PAGE_SIZE;
> > > +             pteval = ptep_get(_pte);
> > > +
> > >               cc->progress++;
> > >
> > > -             pte_t pteval = ptep_get(_pte);
> > >               if (pte_none_or_zero(pteval)) {
> > >                       if (++none_or_zero > max_ptes_none) {
> > >                               result = SCAN_EXCEED_NONE_PTE;
> > > @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >                       }
> > >               }
> > >
> > > +             /* Set bit for occupied pages */
> > > +             __set_bit(i, cc->mthp_bitmap);
> > >               /*
> > >                * Record which node the original page is from and save this
> > >                * information to cc->node_load[].
> > > @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >       if (result == SCAN_SUCCEED) {
> > >               /* collapse_huge_page expects the lock to be dropped before calling */
> > >               mmap_read_unlock(mm);
> > > -             result = collapse_huge_page(mm, start_addr, referenced,
> > > -                                         unmapped, cc, HPAGE_PMD_ORDER);
> > > -             /* collapse_huge_page will return with the mmap_lock released */
> > > +             nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> > > +                                          unmapped, cc, enabled_orders);
> >
> > I guess mthp_collapse() also does PMD collapse if only PMD is enabled?
> >
> > It feels like this name is a bit confusing then :)
> >
> > But I guess we can do a follow up to think of a better name possibly.

Yeah, ideally we can clean that up later!

Thank you so much for reviewing and verifying the new algorithm. The
diff i sent was just a draft-- I have already added comments and
cleaned up the code more.

Cheers :)
-- Nico

> >
> > > +             /* mmap_lock was released above, set lock_dropped */
> > >               *lock_dropped = true;
> > > +             result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> > >       }
> > >  out:
> > >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > > --
> > > 2.54.0
> > >
> >
> > Thanks, Lorenzo
> >


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (10 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-31  7:31   ` Lance Yang
  2026-05-22 15:00 ` [PATCH mm-unstable v18 13/14] mm/khugepaged: run khugepaged for all orders Nico Pache
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif

There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3d7db8be26c..15b7298bc225 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
 			collapse_address = address + offset * PAGE_SIZE;
 			ret = collapse_huge_page(mm, collapse_address, referenced,
 						 unmapped, cc, order);
-			if (ret == SCAN_SUCCEED) {
+
+			switch (ret) {
+			/* Cases where we continue to next collapse candidate */
+			case SCAN_SUCCEED:
 				collapsed += nr_ptes;
+				fallthrough;
+			case SCAN_PTE_MAPPED_HUGEPAGE:
 				continue;
+			/* Cases where lower orders might still succeed */
+			case SCAN_LACK_REFERENCED_PAGE:
+			case SCAN_EXCEED_NONE_PTE:
+			case SCAN_EXCEED_SWAP_PTE:
+			case SCAN_EXCEED_SHARED_PTE:
+			case SCAN_PAGE_LOCK:
+			case SCAN_PAGE_COUNT:
+			case SCAN_PAGE_NULL:
+			case SCAN_DEL_PAGE_LRU:
+			case SCAN_PTE_NON_PRESENT:
+			case SCAN_PTE_UFFD_WP:
+			case SCAN_ALLOC_HUGE_PAGE_FAIL:
+			case SCAN_PAGE_LAZYFREE:
+				goto next_order;
+			/* Cases where no further collapse is possible */
+			default:
+				return collapsed;
 			}
 		}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-05-22 15:00 ` [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2026-05-31  7:31   ` Lance Yang
  2026-05-31 20:02     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Lance Yang @ 2026-05-31  7:31 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif


On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>There are cases where, if an attempted collapse fails, all subsequent
>orders are guaranteed to also fail. Avoid these collapse attempts by
>bailing out early.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index d3d7db8be26c..15b7298bc225 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> 			collapse_address = address + offset * PAGE_SIZE;
> 			ret = collapse_huge_page(mm, collapse_address, referenced,
> 						 unmapped, cc, order);
>-			if (ret == SCAN_SUCCEED) {
>+
>+			switch (ret) {
>+			/* Cases where we continue to next collapse candidate */
>+			case SCAN_SUCCEED:
> 				collapsed += nr_ptes;
>+				fallthrough;
>+			case SCAN_PTE_MAPPED_HUGEPAGE:
> 				continue;
>+			/* Cases where lower orders might still succeed */
>+			case SCAN_LACK_REFERENCED_PAGE:
>+			case SCAN_EXCEED_NONE_PTE:
>+			case SCAN_EXCEED_SWAP_PTE:
>+			case SCAN_EXCEED_SHARED_PTE:
>+			case SCAN_PAGE_LOCK:
>+			case SCAN_PAGE_COUNT:
>+			case SCAN_PAGE_NULL:
>+			case SCAN_DEL_PAGE_LRU:
>+			case SCAN_PTE_NON_PRESENT:
>+			case SCAN_PTE_UFFD_WP:
>+			case SCAN_ALLOC_HUGE_PAGE_FAIL:

Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
here?

If charging the current order fails, a smaller order might still fit :)

Cheers, Lance

>+			case SCAN_PAGE_LAZYFREE:
>+				goto next_order;
>+			/* Cases where no further collapse is possible */
>+			default:
>+				return collapsed;
> 			}
> 		}
> 
>-- 
>2.54.0
>
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-05-31  7:31   ` Lance Yang
@ 2026-05-31 20:02     ` David Hildenbrand (Arm)
  2026-06-01  1:53       ` Lance Yang
  0 siblings, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-31 20:02 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif

On 5/31/26 09:31, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>> There are cases where, if an attempted collapse fails, all subsequent
>> orders are guaranteed to also fail. Avoid these collapse attempts by
>> bailing out early.
>>
>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 24 +++++++++++++++++++++++-
>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d3d7db8be26c..15b7298bc225 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>> 			collapse_address = address + offset * PAGE_SIZE;
>> 			ret = collapse_huge_page(mm, collapse_address, referenced,
>> 						 unmapped, cc, order);
>> -			if (ret == SCAN_SUCCEED) {
>> +
>> +			switch (ret) {
>> +			/* Cases where we continue to next collapse candidate */
>> +			case SCAN_SUCCEED:
>> 				collapsed += nr_ptes;
>> +				fallthrough;
>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>> 				continue;
>> +			/* Cases where lower orders might still succeed */
>> +			case SCAN_LACK_REFERENCED_PAGE:
>> +			case SCAN_EXCEED_NONE_PTE:
>> +			case SCAN_EXCEED_SWAP_PTE:
>> +			case SCAN_EXCEED_SHARED_PTE:
>> +			case SCAN_PAGE_LOCK:
>> +			case SCAN_PAGE_COUNT:
>> +			case SCAN_PAGE_NULL:
>> +			case SCAN_DEL_PAGE_LRU:
>> +			case SCAN_PTE_NON_PRESENT:
>> +			case SCAN_PTE_UFFD_WP:
>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> 
> Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
> here?
> 
> If charging the current order fails, a smaller order might still fit :)

I think the reasoning was here, that if we are already that close to our mem
limit, we should just give up instead of trying to squeeze it in .. :)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
  2026-05-31 20:02     ` David Hildenbrand (Arm)
@ 2026-06-01  1:53       ` Lance Yang
  0 siblings, 0 replies; 114+ messages in thread
From: Lance Yang @ 2026-06-01  1:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm), npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif



On 2026/6/1 04:02, David Hildenbrand (Arm) wrote:
> On 5/31/26 09:31, Lance Yang wrote:
>>
>> On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>>> There are cases where, if an attempted collapse fails, all subsequent
>>> orders are guaranteed to also fail. Avoid these collapse attempts by
>>> bailing out early.
>>>
>>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>>> Acked-by: Usama Arif <usama.arif@linux.dev>
>>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> mm/khugepaged.c | 24 +++++++++++++++++++++++-
>>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d3d7db8be26c..15b7298bc225 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>>> 			collapse_address = address + offset * PAGE_SIZE;
>>> 			ret = collapse_huge_page(mm, collapse_address, referenced,
>>> 						 unmapped, cc, order);
>>> -			if (ret == SCAN_SUCCEED) {
>>> +
>>> +			switch (ret) {
>>> +			/* Cases where we continue to next collapse candidate */
>>> +			case SCAN_SUCCEED:
>>> 				collapsed += nr_ptes;
>>> +				fallthrough;
>>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>>> 				continue;
>>> +			/* Cases where lower orders might still succeed */
>>> +			case SCAN_LACK_REFERENCED_PAGE:
>>> +			case SCAN_EXCEED_NONE_PTE:
>>> +			case SCAN_EXCEED_SWAP_PTE:
>>> +			case SCAN_EXCEED_SHARED_PTE:
>>> +			case SCAN_PAGE_LOCK:
>>> +			case SCAN_PAGE_COUNT:
>>> +			case SCAN_PAGE_NULL:
>>> +			case SCAN_DEL_PAGE_LRU:
>>> +			case SCAN_PTE_NON_PRESENT:
>>> +			case SCAN_PTE_UFFD_WP:
>>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
>>
>> Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
>> here?
>>
>> If charging the current order fails, a smaller order might still fit :)
> 
> I think the reasoning was here, that if we are already that close to our mem
> limit, we should just give up instead of trying to squeeze it in .. :)

Fair point. Just a nit, nevermind :)

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 13/14] mm/khugepaged: run khugepaged for all orders
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (11 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-22 15:00 ` [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif

From: Baolin Wang <baolin.wang@linux.alibaba.com>

If any order (m)THP is enabled we should allow running khugepaged to
attempt scanning and collapsing mTHPs. In order for khugepaged to operate
when only mTHP sizes are specified in sysfs, we must modify the predicate
function that determines whether it ought to run to do so.

This function is currently called hugepage_pmd_enabled(), this patch
renames it to hugepage_enabled() and updates the logic to check to
determine whether any valid orders may exist which would justify
khugepaged running.

We must also update collapse_allowable_orders() to check all orders if
the vma is anonymous and the collapse is khugepaged.

After this patch khugepaged mTHP collapse is fully enabled.

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Usama Arif <usama.arif@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15b7298bc225..6d3f4ff6956a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -529,23 +529,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
 	 * hugepages, when configured in, are determined by the global control.
-	 * Anon pmd-sized hugepages are determined by the pmd-size control.
+	 * Anon hugepages are determined by its per-size mTHP control.
 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
 	 */
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+	if (READ_ONCE(huge_anon_orders_always))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+	if (READ_ONCE(huge_anon_orders_madvise))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+	if (READ_ONCE(huge_anon_orders_inherit) &&
 	    hugepage_global_enabled())
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -586,7 +586,13 @@ void __khugepaged_enter(struct mm_struct *mm)
 static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
 		vm_flags_t vm_flags, enum tva_type tva_flags)
 {
-	unsigned long orders = BIT(HPAGE_PMD_ORDER);
+	unsigned long orders;
+
+	/* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
+	if ((tva_flags == TVA_KHUGEPAGED) && vma_is_anonymous(vma))
+		orders = THP_ORDERS_ALL_ANON;
+	else
+		orders = BIT(HPAGE_PMD_ORDER);
 
 	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
@@ -594,11 +600,9 @@ static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
-	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
-	    hugepage_pmd_enabled()) {
-		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
-			__khugepaged_enter(vma->vm_mm);
-	}
+	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && hugepage_enabled()
+	    && collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
+		__khugepaged_enter(vma->vm_mm);
 }
 
 void __khugepaged_exit(struct mm_struct *mm)
@@ -2949,7 +2953,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+	return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
 }
 
 static int khugepaged_wait_event(void)
@@ -3022,7 +3026,7 @@ static void khugepaged_wait_work(void)
 		return;
 	}
 
-	if (hugepage_pmd_enabled())
+	if (hugepage_enabled())
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
@@ -3053,7 +3057,7 @@ void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!hugepage_enabled()) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -3103,7 +3107,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (hugepage_enabled()) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -3129,7 +3133,7 @@ int start_stop_khugepaged(void)
 void khugepaged_min_free_kbytes_update(void)
 {
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled() && khugepaged_thread)
+	if (hugepage_enabled() && khugepaged_thread)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (12 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 13/14] mm/khugepaged: run khugepaged for all orders Nico Pache
@ 2026-05-22 15:00 ` Nico Pache
  2026-05-22 21:58   ` David Hildenbrand (Arm)
  2026-05-26 14:45   ` Nico Pache
  2026-05-22 15:07 ` [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (3 subsequent siblings)
  17 siblings, 2 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:00 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
	rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
	rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Bagas Sanjaya

Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidance on how to utilize it.

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 50 +++++++++++++---------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 80a4d0bed70b..644869d3adfd 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,8 @@ often.
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages of either PMD size
+or mTHP sizes, if the system is configured to do so.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
 	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 
-khugepaged will be automatically started when PMD-sized THP is enabled
+khugepaged will be automatically started when any THP size is enabled
 (either of the per-size anon control or the top-level control are set
 to "always" or "madvise"), and it'll be automatically shutdown when
-PMD-sized THP is disabled (when both the per-size anon control and the
+all THP sizes are disabled (when both the per-size anon control and the
 top-level control are "never")
 
 process THP controls
@@ -264,11 +265,6 @@ support the following arguments::
 Khugepaged controls
 -------------------
 
-.. note::
-   khugepaged currently only searches for opportunities to collapse to
-   PMD-sized THP and no attempt is made to collapse to other THP
-   sizes.
-
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
@@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
 The khugepaged progress can be seen in the number of pages collapsed (note
 that this counter may not be an exact count of the number of pages
 collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
-being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
-one 2M hugepage. Each may happen independently, or together, depending on
-the type of memory and the failures that occur. As such, this value should
-be interpreted roughly as a sign of progress, and counters in /proc/vmstat
-consulted for more accurate accounting)::
+being replaced by a PMD mapping, or (2) physical pages replaced by one
+hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
+or together, depending on the type of memory and the failures that occur.
+As such, this value should be interpreted roughly as a sign of progress,
+and counters in /proc/vmstat consulted for more accurate accounting)::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
 
@@ -308,16 +304,21 @@ for each pass::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
 
-``max_ptes_none`` specifies how many extra small pages (that are
-not already mapped) can be allocated when collapsing a group
-of small pages into one large page::
+``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
+when collapsing a group of small pages into one large page::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
 
-A higher value leads to use additional memory for programs.
-A lower value leads to gain less thp performance. Value of
-max_ptes_none can waste cpu time very little, you can
-ignore it.
+For PMD-sized THP collapse, this directly limits the number of empty pages
+allowed in the 2MB region.
+
+For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. At
+HPAGE_PMD_NR - 1, we collapse to the highest possible order. Any intermediate
+value will emit a warning and mTHP collapse will default to max_ptes_none=0.
+
+A higher value allows more empty pages, potentially leading to more memory
+usage but better THP performance. A lower value is more conservative and
+may result in fewer THP collapses.
 
 ``max_ptes_swap`` specifies how many pages can be brought in from
 swap when collapsing a group of pages into a transparent huge page::
@@ -337,6 +338,15 @@ that THP is shared. Exceeding the number would block the collapse::
 
 A higher value may increase memory footprint for some workloads.
 
+.. note::
+   For mTHP collapse, khugepaged does not support collapsing regions that
+   contain shared or swapped out pages, as this could lead to continuous
+   promotion to higher orders. The collapse will fail if any shared or
+   swapped PTEs are encountered during the scan.
+
+   Currently, madvise_collapse only supports collapsing to PMD-sized THPs
+   and does not attempt mTHP collapses.
+
 Boot parameters
 ===============
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse
  2026-05-22 15:00 ` [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2026-05-22 21:58   ` David Hildenbrand (Arm)
  2026-05-26 12:00     ` Nico Pache
  2026-05-26 14:45   ` Nico Pache
  1 sibling, 1 reply; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-22 21:58 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Bagas Sanjaya


>  
>  process THP controls
> @@ -264,11 +265,6 @@ support the following arguments::
>  Khugepaged controls
>  -------------------
>  
> -.. note::
> -   khugepaged currently only searches for opportunities to collapse to
> -   PMD-sized THP and no attempt is made to collapse to other THP
> -   sizes.

Should we maybe leave this here and clarify that for file/shmem, it will still
only collapse to PMD-sized THPs?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse
  2026-05-22 21:58   ` David Hildenbrand (Arm)
@ 2026-05-26 12:00     ` Nico Pache
  0 siblings, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-26 12:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Bagas Sanjaya

On Fri, May 22, 2026 at 3:59 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
>
> >
> >  process THP controls
> > @@ -264,11 +265,6 @@ support the following arguments::
> >  Khugepaged controls
> >  -------------------
> >
> > -.. note::
> > -   khugepaged currently only searches for opportunities to collapse to
> > -   PMD-sized THP and no attempt is made to collapse to other THP
> > -   sizes.
>
> Should we maybe leave this here and clarify that for file/shmem, it will still
> only collapse to PMD-sized THPs?

Ah yes that would be a good idea. Ill send a fixup!

Thank you :)

>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse
  2026-05-22 15:00 ` [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
  2026-05-22 21:58   ` David Hildenbrand (Arm)
@ 2026-05-26 14:45   ` Nico Pache
  1 sibling, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-26 14:45 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
  Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Bagas Sanjaya



On 5/22/26 9:00 AM, Nico Pache wrote:
> Now that we can collapse to mTHPs lets update the admin guide to
> reflect these changes and provide proper guidance on how to utilize it.
> 
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

Hi Andrew,

Can you please append the following fixup to this commit. 

The changes are simply undoing a deleted note i added and reworking it slightly to reflect the new khugepaged behavior.

Cheers!
--Nico

commit d81806992231ef920c731e62468a3a1b2ef6b869
Author: Nico Pache <npache@redhat.com>
Date:   Tue May 26 07:47:42 2026 -0600

    fixup: add back note and edit doc about khugepaged limits
    
    Signed-off-by: Nico Pache <npache@redhat.com>

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 644869d3adfd..ebec1e6b0e6b 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -265,6 +265,11 @@ support the following arguments::
 Khugepaged controls
 -------------------
 
+.. note::
+   khugepaged currently only searches for opportunities to collapse file/shmem
+   to PMD-sized THP. Only anonymous memory will attempt to collapse to other THP
+   sizes.
+
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's


>  Documentation/admin-guide/mm/transhuge.rst | 50 +++++++++++++---------
>  1 file changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 80a4d0bed70b..644869d3adfd 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -63,7 +63,8 @@ often.
>  THP can be enabled system wide or restricted to certain tasks or even
>  memory ranges inside task's address space. Unless THP is completely
>  disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages of either PMD size
> +or mTHP sizes, if the system is configured to do so.
>  
>  The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
>  interface and using madvise(2) and prctl(2) system calls.
> @@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
>  	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>  	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>  
> -khugepaged will be automatically started when PMD-sized THP is enabled
> +khugepaged will be automatically started when any THP size is enabled
>  (either of the per-size anon control or the top-level control are set
>  to "always" or "madvise"), and it'll be automatically shutdown when
> -PMD-sized THP is disabled (when both the per-size anon control and the
> +all THP sizes are disabled (when both the per-size anon control and the
>  top-level control are "never")
>  
>  process THP controls
> @@ -264,11 +265,6 @@ support the following arguments::
>  Khugepaged controls
>  -------------------
>  
> -.. note::
> -   khugepaged currently only searches for opportunities to collapse to
> -   PMD-sized THP and no attempt is made to collapse to other THP
> -   sizes.
> -
>  khugepaged runs usually at low frequency so while one may not want to
>  invoke defrag algorithms synchronously during the page faults, it
>  should be worth invoking defrag at least in khugepaged. However it's
> @@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
>  The khugepaged progress can be seen in the number of pages collapsed (note
>  that this counter may not be an exact count of the number of pages
>  collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
> -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
> -one 2M hugepage. Each may happen independently, or together, depending on
> -the type of memory and the failures that occur. As such, this value should
> -be interpreted roughly as a sign of progress, and counters in /proc/vmstat
> -consulted for more accurate accounting)::
> +being replaced by a PMD mapping, or (2) physical pages replaced by one
> +hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
> +or together, depending on the type of memory and the failures that occur.
> +As such, this value should be interpreted roughly as a sign of progress,
> +and counters in /proc/vmstat consulted for more accurate accounting)::
>  
>  	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
>  
> @@ -308,16 +304,21 @@ for each pass::
>  
>  	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
>  
> -``max_ptes_none`` specifies how many extra small pages (that are
> -not already mapped) can be allocated when collapsing a group
> -of small pages into one large page::
> +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
> +when collapsing a group of small pages into one large page::
>  
>  	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>  
> -A higher value leads to use additional memory for programs.
> -A lower value leads to gain less thp performance. Value of
> -max_ptes_none can waste cpu time very little, you can
> -ignore it.
> +For PMD-sized THP collapse, this directly limits the number of empty pages
> +allowed in the 2MB region.
> +
> +For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. At
> +HPAGE_PMD_NR - 1, we collapse to the highest possible order. Any intermediate
> +value will emit a warning and mTHP collapse will default to max_ptes_none=0.
> +
> +A higher value allows more empty pages, potentially leading to more memory
> +usage but better THP performance. A lower value is more conservative and
> +may result in fewer THP collapses.
>  
>  ``max_ptes_swap`` specifies how many pages can be brought in from
>  swap when collapsing a group of pages into a transparent huge page::
> @@ -337,6 +338,15 @@ that THP is shared. Exceeding the number would block the collapse::
>  
>  A higher value may increase memory footprint for some workloads.
>  
> +.. note::
> +   For mTHP collapse, khugepaged does not support collapsing regions that
> +   contain shared or swapped out pages, as this could lead to continuous
> +   promotion to higher orders. The collapse will fail if any shared or
> +   swapped PTEs are encountered during the scan.
> +
> +   Currently, madvise_collapse only supports collapsing to PMD-sized THPs
> +   and does not attempt mTHP collapses.
> +
>  Boot parameters
>  ===============
>  


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (13 preceding siblings ...)
  2026-05-22 15:00 ` [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2026-05-22 15:07 ` Nico Pache
  2026-05-22 15:13   ` Vlastimil Babka (SUSE)
  2026-05-22 15:16   ` Lorenzo Stoakes
  2026-05-22 15:13 ` Lorenzo Stoakes
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 114+ messages in thread
From: Nico Pache @ 2026-05-22 15:07 UTC (permalink / raw)
  To: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Fri, May 22, 2026 at 8:59 AM Nico Pache <npache@redhat.com> wrote:
>
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> pages that are occupied (!none/zero). After the PMD scan is done, we use
> the bitmap to find the optimal mTHP sizes for the PMD range. The
> restriction on max_ptes_none is removed during the scan, to make sure we
> account for the whole PMD range in the bitmap. When no mTHP size is
> enabled, the legacy behavior of khugepaged is maintained.
>
> We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
> (ie 511). If any other value is specified, the kernel will emit a warning
> and mTHP collapse will default to max_ptes_none=0. If a mTHP collapse is
> attempted, but contains swapped out, or shared pages, we don't perform
> the collapse.
> It is now also possible to collapse to mTHPs without requiring the PMD THP
> size to be enabled. These limitations are to prevent collapse "creep"
> behavior. This prevents constantly promoting mTHPs to the next available
> size, which would occur because a collapse introduces more non-zero pages
> that would satisfy the promotion condition on subsequent scans.
>
> Patch 1-2:   Generalize hugepage_vma_revalidate and alloc_charge_folio
>              for arbitrary orders.
> Patch 3:     Rework max_ptes_* handling into helper functions
> Patch 4:     Generalize __collapse_huge_page_* for mTHP support
> Patch 5:     Require collapse_huge_page to enter/exit with the lock dropped
> Patch 6:     Generalize collapse_huge_page for mTHP collapse
> Patch 7:     Skip collapsing mTHP to smaller orders
> Patch 8-9:   Add per-order mTHP statistics and tracepoints
> Patch 10:    Introduce collapse_allowable_orders helper function
> Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
> Patch 14:    Documentation
>
> Testing:
> - Built for x86_64, aarch64, ppc64le, and s390x
> - ran all arches on test suites provided by the kernel-tests project
> - internal testing suites: functional testing and performance testing
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits
>    while monitoring a number of stats and tracepoints. The code is
>    available here[1] (Run in legacy mode for these changes and set mthp
>    sizes to inherit)
>    The summary from my testings was that there was no significant
>    regression noticed through this test. In some cases my changes had
>    better collapse latencies, and was able to scan more pages in the same
>    amount of time/work, but for the most part the results were consistent.
> - redis testing. I did some testing with these changes along with my defer
>   changes (see followup [2] post for more details). We've decided to get
>   the mTHP changes merged first before attempting the defer series.
> - some basic testing on 64k page size.
> - lots of general use.
>
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
>
> V18 Changes:
> - Added RBs/Acks
> - [patch 02] Guard count_memcg_folio_events with is_pmd_order() to keep
>   THP_COLLAPSE_ALLOC PMD-only (Usama, Lance)
> - [patch 03] Convert C++ comments to C-style; fix "none-page" terminology
>   to "empty PTEs or PTEs mapping the shared zeropage"; drop unnecessary
>   userfaultfd comment; add const to local max_ptes_* variables; fix
>   "repect" typo (Lance, David)
> - [patch 04] collapse_max_ptes_none() now returns 0 instead of -EINVAL for
>   unsupported values; remove SCAN_INVALID_PTES_NONE; change return type
>   from int to unsigned int and propagate to all callers; add comment above
>   __collapse_huge_page_swapin explaining mTHP swap bail-out (David,
>   Lorenzo, Lance, Wei Yang, Usama)
> - [patch 05] Rewrite collapse_huge_page lock comment to David's suggested
>   wording (David)
> - [patch 11] Propagate unsigned int return type for max_ptes_none; remove
>   the now-unnecessary negative return check (consequence of patch 04);
>   Add optimization to the next_order goto that will prevent unnecessary
>   iterations if there are no lower orders enabled (Vernon); update locking
>   comment; pass VMA to mthp_collapse to improve uffd-armed detection, and
>   prevent unnecessary work. (Wei)
> - [patch 14] Update documentation to reflect fallback-to-0 behavior
>
> V17: https://lore.kernel.org/all/20260511185817.686831-1-npache@redhat.com
> V16: https://lore.kernel.org/all/20260419185750.260784-1-npache@redhat.com
> V15: https://lore.kernel.org/all/20260226031741.230674-1-npache@redhat.com
> V14: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com
> V13: https://lore.kernel.org/all/20251201174627.23295-1-npache@redhat.com
> V12: https://lore.kernel.org/all/20251022183717.70829-1-npache@redhat.com
> V11: https://lore.kernel.org/all/20250912032810.197475-1-npache@redhat.com
> V10: https://lore.kernel.org/all/20250819134205.622806-1-npache@redhat.com
> V9 : https://lore.kernel.org/all/20250714003207.113275-1-npache@redhat.com
> V8 : https://lore.kernel.org/all/20250702055742.102808-1-npache@redhat.com
> V7 : https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com
> V6 : https://lore.kernel.org/all/20250515030312.125567-1-npache@redhat.com
> V5 : https://lore.kernel.org/all/20250428181218.85925-1-npache@redhat.com
> V4 : https://lore.kernel.org/all/20250417000238.74567-1-npache@redhat.com
> V3 : https://lore.kernel.org/all/20250414220557.35388-1-npache@redhat.com
> V2 : https://lore.kernel.org/all/20250211003028.213461-1-npache@redhat.com
> V1 : https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com
>
> Baolin Wang (1):
>   mm/khugepaged: run khugepaged for all orders
>
> Dev Jain (1):
>   mm/khugepaged: generalize alloc_charge_folio()
>
> Nico Pache (12):
>   mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   mm/khugepaged: rework max_ptes_* handling with helper functions
>   mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
>   mm/khugepaged: require collapse_huge_page to enter/exit with the lock
>     dropped
>   mm/khugepaged: generalize collapse_huge_page for mTHP collapse
>   mm/khugepaged: skip collapsing mTHP to smaller orders
>   mm/khugepaged: add per-order mTHP collapse failure statistics
>   mm/khugepaged: improve tracepoints for mTHP orders
>   mm/khugepaged: introduce collapse_allowable_orders helper function
>   mm/khugepaged: Introduce mTHP collapse support
>   mm/khugepaged: avoid unnecessary mTHP collapse attempts
>   Documentation: mm: update the admin guide for mTHP collapse
>
>  Documentation/admin-guide/mm/transhuge.rst |  72 ++-
>  include/linux/huge_mm.h                    |   5 +
>  include/trace/events/huge_memory.h         |  34 +-
>  mm/huge_memory.c                           |  11 +
>  mm/khugepaged.c                            | 634 ++++++++++++++++-----
>  5 files changed, 584 insertions(+), 172 deletions(-)
>
>
> base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346

Whoops I manually changed the coverletter subject to reflect that this
in on mm-hotfixes-unstable but never updated the others...

Hopefully that is ok. Just a small mistake. Base commit is referenced here.

-- Nico


> --
> 2.54.0
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 15:07 ` [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
@ 2026-05-22 15:13   ` Vlastimil Babka (SUSE)
  2026-05-22 16:11     ` Nico Pache
  2026-05-22 15:16   ` Lorenzo Stoakes
  1 sibling, 1 reply; 114+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-22 15:13 UTC (permalink / raw)
  To: Nico Pache, linux-doc, akpm, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 5/22/26 17:07, Nico Pache wrote:
> On Fri, May 22, 2026 at 8:59 AM Nico Pache <npache@redhat.com> wrote:
>>  include/trace/events/huge_memory.h         |  34 +-
>>  mm/huge_memory.c                           |  11 +
>>  mm/khugepaged.c                            | 634 ++++++++++++++++-----
>>  5 files changed, 584 insertions(+), 172 deletions(-)
>>
>>
>> base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346
> 
> Whoops I manually changed the coverletter subject to reflect that this
> in on mm-hotfixes-unstable but never updated the others...

But why? That branch is for hotfixes that would go to the current 7.1-rcX
series. mm-unstable would be the correct one for this, AFAICT.

> Hopefully that is ok. Just a small mistake. Base commit is referenced here.
> 
> -- Nico
> 
> 
>> --
>> 2.54.0
>>
> 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 15:13   ` Vlastimil Babka (SUSE)
@ 2026-05-22 16:11     ` Nico Pache
  2026-05-22 21:13       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-22 16:11 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Fri, May 22, 2026 at 9:13 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> On 5/22/26 17:07, Nico Pache wrote:
> > On Fri, May 22, 2026 at 8:59 AM Nico Pache <npache@redhat.com> wrote:
> >>  include/trace/events/huge_memory.h         |  34 +-
> >>  mm/huge_memory.c                           |  11 +
> >>  mm/khugepaged.c                            | 634 ++++++++++++++++-----
> >>  5 files changed, 584 insertions(+), 172 deletions(-)
> >>
> >>
> >> base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346
> >
> > Whoops I manually changed the coverletter subject to reflect that this
> > in on mm-hotfixes-unstable but never updated the others...
>
> But why? That branch is for hotfixes that would go to the current 7.1-rcX
> series. mm-unstable would be the correct one for this, AFAICT.

Sorry this was a misunderstanding. The goal here was to base this off
the closest base commit behind where my v17 already lies in the tree.

That just happened to be the hotfixes tree (previously it was
mm-unstable, but that seems the have moved).

Sorry...
-- Nico

>
> > Hopefully that is ok. Just a small mistake. Base commit is referenced here.
> >
> > -- Nico
> >
> >
> >> --
> >> 2.54.0
> >>
> >
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 16:11     ` Nico Pache
@ 2026-05-22 21:13       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 114+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-22 21:13 UTC (permalink / raw)
  To: Nico Pache, Vlastimil Babka (SUSE)
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On 5/22/26 18:11, Nico Pache wrote:
> On Fri, May 22, 2026 at 9:13 AM Vlastimil Babka (SUSE)
> <vbabka@kernel.org> wrote:
>>
>> On 5/22/26 17:07, Nico Pache wrote:
>>>
>>> Whoops I manually changed the coverletter subject to reflect that this
>>> in on mm-hotfixes-unstable but never updated the others...
>>
>> But why? That branch is for hotfixes that would go to the current 7.1-rcX
>> series. mm-unstable would be the correct one for this, AFAICT.
> 
> Sorry this was a misunderstanding. The goal here was to base this off
> the closest base commit behind where my v17 already lies in the tree.

Ah, I guess this is a problem of "v17 is already in mm-unstable, so against what
to base v18".

Yeah, we touched on that problem in the LSF/MM process discussion ...

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 15:07 ` [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
  2026-05-22 15:13   ` Vlastimil Babka (SUSE)
@ 2026-05-22 15:16   ` Lorenzo Stoakes
  2026-05-22 16:08     ` Nico Pache
  1 sibling, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 15:16 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote:
> Whoops I manually changed the coverletter subject to reflect that this
> in on mm-hotfixes-unstable but never updated the others...
>
> Hopefully that is ok. Just a small mistake. Base commit is referenced here.

It's not ok, this isn't suitable for a hotfix in any way shape or form?

As you know, because we told you :) May has been difficult because of
conferences, holidays (and in my case burnout recovery).

And unfortunately the series seems to have needed quite a bit of review again
(my suggestion to you would be to ensure you don't make major changes, only
small incremental ones on the basis of review feedback).

So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there
was no rush.

Also please don't spring a respin on this series on us without discussion
first, with people away and (frankly) the amount of work involved here,
you're going to have to accept the pace that workload/availability permits.

Adding spurious hotfixes tags doesn't help anything :) please don't do that
again.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 15:16   ` Lorenzo Stoakes
@ 2026-05-22 16:08     ` Nico Pache
  2026-05-22 16:19       ` Lorenzo Stoakes
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-22 16:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Fri, May 22, 2026 at 9:17 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote:
> > Whoops I manually changed the coverletter subject to reflect that this
> > in on mm-hotfixes-unstable but never updated the others...
> >
> > Hopefully that is ok. Just a small mistake. Base commit is referenced here.
>
> It's not ok, this isn't suitable for a hotfix in any way shape or form?
>
> As you know, because we told you :) May has been difficult because of
> conferences, holidays (and in my case burnout recovery).
>
> And unfortunately the series seems to have needed quite a bit of review again
> (my suggestion to you would be to ensure you don't make major changes, only
> small incremental ones on the basis of review feedback).
>
> So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there
> was no rush.
>
> Also please don't spring a respin on this series on us without discussion
> first, with people away and (frankly) the amount of work involved here,
> you're going to have to accept the pace that workload/availability permits.
>
> Adding spurious hotfixes tags doesn't help anything :) please don't do that
> again.

Hi,

Sorry for the confusion but Andrew and I spoke about this before I
sent it, and he confirmed that I should send it against this tree to
prevent merge conflicts.

Because Zi's series depends on this, and this is already in the mm
tree, choosing a candidate before my commits was best to prevent merge
conflicts.

The intent wasn't that this is a hotfix, just that this was the
closest base before the v17 that is already in the tree.

Sorry for the confusion, hopefully Andrew can still apply it to the
correct tree.

-- Nico

>
> Thanks, Lorenzo
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 16:08     ` Nico Pache
@ 2026-05-22 16:19       ` Lorenzo Stoakes
  2026-05-22 16:31         ` Nico Pache
  0 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 16:19 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Fri, May 22, 2026 at 10:08:19AM -0600, Nico Pache wrote:
> On Fri, May 22, 2026 at 9:17 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote:
> > > Whoops I manually changed the coverletter subject to reflect that this
> > > in on mm-hotfixes-unstable but never updated the others...
> > >
> > > Hopefully that is ok. Just a small mistake. Base commit is referenced here.
> >
> > It's not ok, this isn't suitable for a hotfix in any way shape or form?
> >
> > As you know, because we told you :) May has been difficult because of
> > conferences, holidays (and in my case burnout recovery).
> >
> > And unfortunately the series seems to have needed quite a bit of review again
> > (my suggestion to you would be to ensure you don't make major changes, only
> > small incremental ones on the basis of review feedback).
> >
> > So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there
> > was no rush.
> >
> > Also please don't spring a respin on this series on us without discussion
> > first, with people away and (frankly) the amount of work involved here,
> > you're going to have to accept the pace that workload/availability permits.
> >
> > Adding spurious hotfixes tags doesn't help anything :) please don't do that
> > again.
>
> Hi,
>
> Sorry for the confusion but Andrew and I spoke about this before I
> sent it, and he confirmed that I should send it against this tree to
> prevent merge conflicts.
>
> Because Zi's series depends on this, and this is already in the mm
> tree, choosing a candidate before my commits was best to prevent merge
> conflicts.

There's some kind of confusion here.

This series isn't suited for 7.2.

Sorry but Zi's series, unless it depends on functionality here, will have
to be rebased.

People have been at conferences, people have been on leave, I've had to
pace myself for health reasons and it seems there's been more than simply
review comment-based changes happening here.

(Again I strongly encourage, at this stage, to ONLY be making changes based
on review, not adding ANYTHING else or changing ANYTHING else to avoid
delays :)

Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it?

I think in mm-next we will have an stable branch, that everything is
based on, where things go once review is complete and things are mergeable.

And a separate hotfixes branch based on Linus's tree.

That would avoid issues like this :)

>
> The intent wasn't that this is a hotfix, just that this was the
> closest base before the v17 that is already in the tree.

The convention is that [PATCH ... <branch>] indicates the target of the
changes. Putting the hotfixes branch there implies it's a hotfix.

So please be careful with that in future :)

>
> Sorry for the confusion, hopefully Andrew can still apply it to the
> correct tree.

I'm not even sure what's best for that at this stage given we have
conflicts and this has to be delayed until 7.3.

I wonder if given that we should not have this in mm-unstable at all and
just wait it out until the next cycle begins? Review can happen
concurrently.

>
> -- Nico
>
> >
> > Thanks, Lorenzo
> >
>

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 16:19       ` Lorenzo Stoakes
@ 2026-05-22 16:31         ` Nico Pache
  2026-05-22 17:12           ` Lorenzo Stoakes
  0 siblings, 1 reply; 114+ messages in thread
From: Nico Pache @ 2026-05-22 16:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Fri, May 22, 2026 at 10:20 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 10:08:19AM -0600, Nico Pache wrote:
> > On Fri, May 22, 2026 at 9:17 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote:
> > > > Whoops I manually changed the coverletter subject to reflect that this
> > > > in on mm-hotfixes-unstable but never updated the others...
> > > >
> > > > Hopefully that is ok. Just a small mistake. Base commit is referenced here.
> > >
> > > It's not ok, this isn't suitable for a hotfix in any way shape or form?
> > >
> > > As you know, because we told you :) May has been difficult because of
> > > conferences, holidays (and in my case burnout recovery).
> > >
> > > And unfortunately the series seems to have needed quite a bit of review again
> > > (my suggestion to you would be to ensure you don't make major changes, only
> > > small incremental ones on the basis of review feedback).
> > >
> > > So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there
> > > was no rush.
> > >
> > > Also please don't spring a respin on this series on us without discussion
> > > first, with people away and (frankly) the amount of work involved here,
> > > you're going to have to accept the pace that workload/availability permits.
> > >
> > > Adding spurious hotfixes tags doesn't help anything :) please don't do that
> > > again.
> >
> > Hi,
> >
> > Sorry for the confusion but Andrew and I spoke about this before I
> > sent it, and he confirmed that I should send it against this tree to
> > prevent merge conflicts.
> >
> > Because Zi's series depends on this, and this is already in the mm
> > tree, choosing a candidate before my commits was best to prevent merge
> > conflicts.
>
> There's some kind of confusion here.
>
> This series isn't suited for 7.2.
>
> Sorry but Zi's series, unless it depends on functionality here, will have
> to be rebased.
>
> People have been at conferences, people have been on leave, I've had to
> pace myself for health reasons and it seems there's been more than simply
> review comment-based changes happening here.
>
> (Again I strongly encourage, at this stage, to ONLY be making changes based
> on review, not adding ANYTHING else or changing ANYTHING else to avoid
> delays :)

All the changes are based on review points. Very small changes in this
version; the largest being the one that you specifically argeed too.

>
> Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it?
>
> I think in mm-next we will have an stable branch, that everything is
> based on, where things go once review is complete and things are mergeable.
>
> And a separate hotfixes branch based on Linus's tree.
>
> That would avoid issues like this :)

Im sorry im new to this, but I really dont think this tiny error, and
something that I'd confirmed with Andrew beforehand deserves NAKing
and defering it. Ive worked through my PTO to clean up some of these
review nits just to get it in 7.2. I even through this through my
rounds of testing today before resending.

>
> >
> > The intent wasn't that this is a hotfix, just that this was the
> > closest base before the v17 that is already in the tree.
>
> The convention is that [PATCH ... <branch>] indicates the target of the
> changes. Putting the hotfixes branch there implies it's a hotfix.

Sorry I thought the <branch> was what base you used.

>
> So please be careful with that in future :)

Yes will do for sure.

>
> >
> > Sorry for the confusion, hopefully Andrew can still apply it to the
> > correct tree.
>
> I'm not even sure what's best for that at this stage given we have
> conflicts and this has to be delayed until 7.3.
>
> I wonder if given that we should not have this in mm-unstable at all and
> just wait it out until the next cycle begins? Review can happen
> concurrently.

I still dont see why this has to be deferred, I was working with
Andrew to prevent merge headaches.

-- Nico

>
> >
> > -- Nico
> >
> > >
> > > Thanks, Lorenzo
> > >
> >
>
> Thanks, Lorenzo
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 16:31         ` Nico Pache
@ 2026-05-22 17:12           ` Lorenzo Stoakes
  2026-05-26  8:14             ` Lorenzo Stoakes
  0 siblings, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 17:12 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

On Fri, May 22, 2026 at 10:31:41AM -0600, Nico Pache wrote:
> On Fri, May 22, 2026 at 10:20 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > There's some kind of confusion here.
> >
> > This series isn't suited for 7.2.
> >
> > Sorry but Zi's series, unless it depends on functionality here, will have
> > to be rebased.
> >
> > People have been at conferences, people have been on leave, I've had to
> > pace myself for health reasons and it seems there's been more than simply
> > review comment-based changes happening here.
> >
> > (Again I strongly encourage, at this stage, to ONLY be making changes based
> > on review, not adding ANYTHING else or changing ANYTHING else to avoid
> > delays :)
>
> All the changes are based on review points. Very small changes in this
> version; the largest being the one that you specifically argeed too.

16->17

 Documentation/admin-guide/mm/transhuge.rst |  24 +++++-------------
 include/linux/khugepaged.h                 |   7 ++---
 include/trace/events/huge_memory.h         |   3 ++-
 mm/huge_memory.c                           |   2 +-
 mm/khugepaged.c                            | 168 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------------------
 mm/vma.c                                   |   6 ++---
 tools/testing/vma/include/stubs.h          |   3 ++-
 7 files changed, 103 insertions(+), 110 deletions(-)

17->18

 Documentation/admin-guide/mm/transhuge.rst |   5 +++--
 include/trace/events/huge_memory.h         |   3 +--
 mm/khugepaged.c                            | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
 3 files changed, 66 insertions(+), 63 deletions(-)

These are not small 'very small changes'.

We're nearly at rc-5, and this is a major, invasive, dangerous change that
we have to get right.

You've also made changes unrelated to review, repeatedly, throughout this
process, which as I've told you, is causing delays.

You've also throughout the review of this series done stuff like make MAJOR
changes to things and _kept review tags_.

You're forcing us to use git range-diff etc. to forensically check that the
series is what is claimed.

Dude I mean you switched to using // comment style which is not used in mm
anywhere for instance? Don't do things like that and complain about
delays. Honestly.

Also, again, LSF happened. Other confeerences happened. Bandwidth is
reduced.

So again, I'm sorry, but you've been hit with some bad luck here.

I really wanted this in for 7.2, and I feel bad that we couldn't make it,
but you're also doing thing that's making it difficult for us.

I've spent double-digits hours on your series, and I've also had work
pushed out becasue of that leading me to work evenings and weekends as a
result.

And I'm not even going to get any credit for it :))

So while I sypmathise, really, please have empathy and realise it goes both
ways, please.

I'm not being mean for the sake of it, I'm pushing back because I feel this
is not at a stage where I'd feel confident in this being merged at this
time.

And it's very much a regret, as I _really_ wanted us to have it in this
time. But life and circumstances and the issues mentioned above have
intervened, sadly.

>
> >
> > Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it?
> >
> > I think in mm-next we will have an stable branch, that everything is
> > based on, where things go once review is complete and things are mergeable.
> >
> > And a separate hotfixes branch based on Linus's tree.
> >
> > That would avoid issues like this :)
>
> Im sorry im new to this, but I really dont think this tiny error, and
> something that I'd confirmed with Andrew beforehand deserves NAKing
> and defering it. Ive worked through my PTO to clean up some of these
> review nits just to get it in 7.2. I even through this through my
> rounds of testing today before resending.

The issue wasn't the error (though it wasn't tiny...!), it's the state of
review. There was fresh review comments from a few days ago, and there's
big diffs between revisions.

You've also made unrelated changes as you have done throughout the series.

As I said above, I'm sorry that you spent time in your PTO on this, but we
cannot rush this in when things are not clearly ready yet, and I am not
confident in this being ready at this stage.

>
> >
> > >
> > > The intent wasn't that this is a hotfix, just that this was the
> > > closest base before the v17 that is already in the tree.
> >
> > The convention is that [PATCH ... <branch>] indicates the target of the
> > changes. Putting the hotfixes branch there implies it's a hotfix.
>
> Sorry I thought the <branch> was what base you used.

I mean, sure there's clearly confusion here as you sent [PATCH 7.2 v16 ...]
(against an unreleased kernel version) then a branch specifier then the
hotfixes one...

Anyway sure, it's fine, I've made vastly more dumb mistakes than that
myself, nobody minds, but it's concerning as by convention [PATCH
... <mm->hotfixes<whatever>] generally is taken to mean 'please rush this
to hotfixes!' :)

So be careful with that please!

>
> >
> > So please be careful with that in future :)
>
> Yes will do for sure.

Thanks!

>
> >
> > >
> > > Sorry for the confusion, hopefully Andrew can still apply it to the
> > > correct tree.
> >
> > I'm not even sure what's best for that at this stage given we have
> > conflicts and this has to be delayed until 7.3.
> >
> > I wonder if given that we should not have this in mm-unstable at all and
> > just wait it out until the next cycle begins? Review can happen
> > concurrently.
>
> I still dont see why this has to be deferred, I was working with
> Andrew to prevent merge headaches.

I've explained the why above, and David and I co-maintain THP so I feel
that ultimately given the blood, sweat and tears we've put into THP review
we ought to have some input on this :)

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 17:12           ` Lorenzo Stoakes
@ 2026-05-26  8:14             ` Lorenzo Stoakes
  0 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-05-26  8:14 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe

Nico,

While I stand by the below, and we very well might wish to delay this until
the next cycle, I will try to take some time to go through this myself as
soon as I am able.

If David's happy with it for this cycle, and I don't find anything too
crazy, then it's not impossible we could still move forward with it now.

My only aim here is to avoid rushing something in that might have
unexpected changes or issues in it, given how late in the cycle we are :)

Cheers, Lorenzo

On Fri, May 22, 2026 at 06:12:59PM +0100, Lorenzo Stoakes wrote:
> On Fri, May 22, 2026 at 10:31:41AM -0600, Nico Pache wrote:
> > On Fri, May 22, 2026 at 10:20 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > There's some kind of confusion here.
> > >
> > > This series isn't suited for 7.2.
> > >
> > > Sorry but Zi's series, unless it depends on functionality here, will have
> > > to be rebased.
> > >
> > > People have been at conferences, people have been on leave, I've had to
> > > pace myself for health reasons and it seems there's been more than simply
> > > review comment-based changes happening here.
> > >
> > > (Again I strongly encourage, at this stage, to ONLY be making changes based
> > > on review, not adding ANYTHING else or changing ANYTHING else to avoid
> > > delays :)
> >
> > All the changes are based on review points. Very small changes in this
> > version; the largest being the one that you specifically argeed too.
>
> 16->17
>
>  Documentation/admin-guide/mm/transhuge.rst |  24 +++++-------------
>  include/linux/khugepaged.h                 |   7 ++---
>  include/trace/events/huge_memory.h         |   3 ++-
>  mm/huge_memory.c                           |   2 +-
>  mm/khugepaged.c                            | 168 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------------------
>  mm/vma.c                                   |   6 ++---
>  tools/testing/vma/include/stubs.h          |   3 ++-
>  7 files changed, 103 insertions(+), 110 deletions(-)
>
> 17->18
>
>  Documentation/admin-guide/mm/transhuge.rst |   5 +++--
>  include/trace/events/huge_memory.h         |   3 +--
>  mm/khugepaged.c                            | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
>  3 files changed, 66 insertions(+), 63 deletions(-)
>
> These are not small 'very small changes'.
>
> We're nearly at rc-5, and this is a major, invasive, dangerous change that
> we have to get right.
>
> You've also made changes unrelated to review, repeatedly, throughout this
> process, which as I've told you, is causing delays.
>
> You've also throughout the review of this series done stuff like make MAJOR
> changes to things and _kept review tags_.
>
> You're forcing us to use git range-diff etc. to forensically check that the
> series is what is claimed.
>
> Dude I mean you switched to using // comment style which is not used in mm
> anywhere for instance? Don't do things like that and complain about
> delays. Honestly.
>
> Also, again, LSF happened. Other confeerences happened. Bandwidth is
> reduced.
>
> So again, I'm sorry, but you've been hit with some bad luck here.
>
> I really wanted this in for 7.2, and I feel bad that we couldn't make it,
> but you're also doing thing that's making it difficult for us.
>
> I've spent double-digits hours on your series, and I've also had work
> pushed out becasue of that leading me to work evenings and weekends as a
> result.
>
> And I'm not even going to get any credit for it :))
>
> So while I sypmathise, really, please have empathy and realise it goes both
> ways, please.
>
> I'm not being mean for the sake of it, I'm pushing back because I feel this
> is not at a stage where I'd feel confident in this being merged at this
> time.
>
> And it's very much a regret, as I _really_ wanted us to have it in this
> time. But life and circumstances and the issues mentioned above have
> intervened, sadly.
>
> >
> > >
> > > Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it?
> > >
> > > I think in mm-next we will have an stable branch, that everything is
> > > based on, where things go once review is complete and things are mergeable.
> > >
> > > And a separate hotfixes branch based on Linus's tree.
> > >
> > > That would avoid issues like this :)
> >
> > Im sorry im new to this, but I really dont think this tiny error, and
> > something that I'd confirmed with Andrew beforehand deserves NAKing
> > and defering it. Ive worked through my PTO to clean up some of these
> > review nits just to get it in 7.2. I even through this through my
> > rounds of testing today before resending.
>
> The issue wasn't the error (though it wasn't tiny...!), it's the state of
> review. There was fresh review comments from a few days ago, and there's
> big diffs between revisions.
>
> You've also made unrelated changes as you have done throughout the series.
>
> As I said above, I'm sorry that you spent time in your PTO on this, but we
> cannot rush this in when things are not clearly ready yet, and I am not
> confident in this being ready at this stage.
>
> >
> > >
> > > >
> > > > The intent wasn't that this is a hotfix, just that this was the
> > > > closest base before the v17 that is already in the tree.
> > >
> > > The convention is that [PATCH ... <branch>] indicates the target of the
> > > changes. Putting the hotfixes branch there implies it's a hotfix.
> >
> > Sorry I thought the <branch> was what base you used.
>
> I mean, sure there's clearly confusion here as you sent [PATCH 7.2 v16 ...]
> (against an unreleased kernel version) then a branch specifier then the
> hotfixes one...
>
> Anyway sure, it's fine, I've made vastly more dumb mistakes than that
> myself, nobody minds, but it's concerning as by convention [PATCH
> ... <mm->hotfixes<whatever>] generally is taken to mean 'please rush this
> to hotfixes!' :)
>
> So be careful with that please!
>
> >
> > >
> > > So please be careful with that in future :)
> >
> > Yes will do for sure.
>
> Thanks!
>
> >
> > >
> > > >
> > > > Sorry for the confusion, hopefully Andrew can still apply it to the
> > > > correct tree.
> > >
> > > I'm not even sure what's best for that at this stage given we have
> > > conflicts and this has to be delayed until 7.3.
> > >
> > > I wonder if given that we should not have this in mm-unstable at all and
> > > just wait it out until the next cycle begins? Review can happen
> > > concurrently.
> >
> > I still dont see why this has to be deferred, I was working with
> > Andrew to prevent merge headaches.
>
> I've explained the why above, and David and I co-maintain THP so I feel
> that ultimately given the blood, sweat and tears we've put into THP review
> we ought to have some input on this :)
>
> Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (14 preceding siblings ...)
  2026-05-22 15:07 ` [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
@ 2026-05-22 15:13 ` Lorenzo Stoakes
  2026-05-22 20:47 ` Andrew Morton
  2026-06-04 10:10 ` Lorenzo Stoakes
  17 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 15:13 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

NAK.

This is not a hotfixes candidate Nico. Don't send a massive series like this
with that tag please.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (15 preceding siblings ...)
  2026-05-22 15:13 ` Lorenzo Stoakes
@ 2026-05-22 20:47 ` Andrew Morton
  2026-06-01 15:58   ` Alexander Gordeev
  2026-06-04 10:10 ` Lorenzo Stoakes
  17 siblings, 1 reply; 114+ messages in thread
From: Andrew Morton @ 2026-05-22 20:47 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:

> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.

Thanks, I've update mm.git's mm-unstable branch to this version.

It sounds like I might be dropping it soon, haven't started looking at
that yet.  But let's at least eyeball the latest version at this time.

Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
well, thanks.  The AI checking made a few allegations:

	https://sashiko.dev/#/patchset/20260522150009.121603-1-npache@redhat.com

> V18 Changes:
> - Added RBs/Acks
> - [patch 02] Guard count_memcg_folio_events with is_pmd_order() to keep
>   THP_COLLAPSE_ALLOC PMD-only (Usama, Lance)
> - [patch 03] Convert C++ comments to C-style; fix "none-page" terminology
>   to "empty PTEs or PTEs mapping the shared zeropage"; drop unnecessary
>   userfaultfd comment; add const to local max_ptes_* variables; fix
>   "repect" typo (Lance, David)
> - [patch 04] collapse_max_ptes_none() now returns 0 instead of -EINVAL for
>   unsupported values; remove SCAN_INVALID_PTES_NONE; change return type
>   from int to unsigned int and propagate to all callers; add comment above
>   __collapse_huge_page_swapin explaining mTHP swap bail-out (David,
>   Lorenzo, Lance, Wei Yang, Usama)
> - [patch 05] Rewrite collapse_huge_page lock comment to David's suggested
>   wording (David)
> - [patch 11] Propagate unsigned int return type for max_ptes_none; remove
>   the now-unnecessary negative return check (consequence of patch 04);
>   Add optimization to the next_order goto that will prevent unnecessary
>   iterations if there are no lower orders enabled (Vernon); update locking
>   comment; pass VMA to mthp_collapse to improve uffd-armed detection, and
>   prevent unnecessary work. (Wei)
> - [patch 14] Update documentation to reflect fallback-to-0 behavior
> 

Below is how v18 altered mm.git.

Quite a lot of it seems to be replacement of "//"-style comments.  It's
unfortunate that this work isn't separated from the substantive
changes.  We could have done that with a few followup fixes rather than
a wholesale replacement of the series.


 Documentation/admin-guide/mm/transhuge.rst |    5 
 include/trace/events/huge_memory.h         |    3 
 mm/khugepaged.c                            |  121 +++++++++----------
 3 files changed, 66 insertions(+), 63 deletions(-)

--- a/Documentation/admin-guide/mm/transhuge.rst~b
+++ a/Documentation/admin-guide/mm/transhuge.rst
@@ -312,8 +312,9 @@ when collapsing a group of small pages i
 For PMD-sized THP collapse, this directly limits the number of empty pages
 allowed in the 2MB region.
 
-For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other value
-will emit a warning and no mTHP collapse will be attempted.
+For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. At
+HPAGE_PMD_NR - 1, we collapse to the highest possible order. Any intermediate
+value will emit a warning and mTHP collapse will default to max_ptes_none=0.
 
 A higher value allows more empty pages, potentially leading to more memory
 usage but better THP performance. A lower value is more conservative and
--- a/include/trace/events/huge_memory.h~b
+++ a/include/trace/events/huge_memory.h
@@ -39,8 +39,7 @@
 	EM( SCAN_STORE_FAILED,		"store_failed")			\
 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
 	EM( SCAN_PAGE_FILLED,		"page_filled")			\
-	EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")	\
-	EMe(SCAN_INVALID_PTES_NONE,	"invalid_ptes_none")
+	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
 
 #undef EM
 #undef EMe
--- a/mm/khugepaged.c~b
+++ a/mm/khugepaged.c
@@ -61,7 +61,6 @@ enum scan_result {
 	SCAN_COPY_MC,
 	SCAN_PAGE_FILLED,
 	SCAN_PAGE_DIRTY_OR_WRITEBACK,
-	SCAN_INVALID_PTES_NONE,
 };
 
 #define CREATE_TRACE_POINTS
@@ -380,41 +379,43 @@ static bool pte_none_or_zero(pte_t pte)
 }
 
 /**
- * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
- * PTEs for the given collapse operation.
+ * collapse_max_ptes_none - Calculate maximum allowed empty PTEs or PTEs mapping
+ * the shared zeropage for the given collapse operation.
  * @cc: The collapse control struct
  * @vma: The vma to check for userfaultfd
  * @order: The folio order being collapsed to
  *
- * Return: Maximum number of none-page or zero-page PTEs allowed for the
- * collapse operation.
+ * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
  */
-static int collapse_max_ptes_none(struct collapse_control *cc,
+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
 		struct vm_area_struct *vma, unsigned int order)
 {
 	unsigned int max_ptes_none = khugepaged_max_ptes_none;
-	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
+
 	if (vma && userfaultfd_armed(vma))
 		return 0;
-	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
+	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
 	if (!cc->is_khugepaged)
 		return HPAGE_PMD_NR;
-	// for PMD collapse, respect the user defined maximum.
+	/* for PMD collapse, respect the user defined maximum */
 	if (is_pmd_order(order))
 		return max_ptes_none;
-	/* Zero/non-present collapse disabled. */
-	if (!max_ptes_none)
-		return 0;
-	// for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
-	// scale the maximum number of PTEs to the order of the collapse.
+	/*
+	 * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
+	 * scale the maximum number of PTEs to the order of the collapse.
+	 */
 	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
 		return (1 << order) - 1;
-
-	// We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
-	// Emit a warning and return -EINVAL.
-	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
-		      KHUGEPAGED_MAX_PTES_LIMIT);
-	return -EINVAL;
+	if (!max_ptes_none)
+		return 0;
+	/*
+	 * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
+	 * emit a warning and return 0.
+	 */
+	pr_warn_once("mTHP collapse does not support max_ptes_none values"
+		     " other than 0 or %u, defaulting to 0.\n",
+		     KHUGEPAGED_MAX_PTES_LIMIT);
+	return 0;
 }
 
 /**
@@ -429,15 +430,19 @@ static int collapse_max_ptes_none(struct
 static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
 		unsigned int order)
 {
-	// for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
-	// anonymous pages.
+	/*
+	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
+	 * anonymous pages.
+	 */
 	if (!cc->is_khugepaged)
 		return HPAGE_PMD_NR;
-	// for mTHP collapse do not allow collapsing anonymous memory pages that
-	// are shared between processes.
+	/*
+	 * for mTHP collapse do not allow collapsing anonymous memory pages that
+	 * are shared between processes.
+	 */
 	if (!is_pmd_order(order))
 		return 0;
-	// for PMD collapse, respect the user defined maximum.
+	/* for PMD collapse, respect the user defined maximum */
 	return khugepaged_max_ptes_shared;
 }
 
@@ -453,14 +458,16 @@ static unsigned int collapse_max_ptes_sh
 static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
 		unsigned int order)
 {
-	// for MADV_COLLAPSE, do not restrict the number PTEs entries or
-	// pagecache entries that are non-present.
+	/*
+	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
+	 * pagecache entries that are non-present.
+	 */
 	if (!cc->is_khugepaged)
 		return HPAGE_PMD_NR;
-	// for mTHP collapse do not allow any non-present PTEs or pagecache entries.
+	/* for mTHP collapse do not allow any non-present PTEs or pagecache entries */
 	if (!is_pmd_order(order))
 		return 0;
-	// for PMD collapse, respect the user defined maximum.
+	/* for PMD collapse, respect the user defined maximum */
 	return khugepaged_max_ptes_swap;
 }
 
@@ -593,9 +600,8 @@ static unsigned long collapse_allowable_
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
-	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
-	    collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED) &&
-	    hugepage_enabled())
+	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && hugepage_enabled()
+	    && collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
 		__khugepaged_enter(vma->vm_mm);
 }
 
@@ -670,6 +676,8 @@ static enum scan_result __collapse_huge_
 		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
 		unsigned int order, struct list_head *compound_pagelist)
 {
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
 	const unsigned long nr_pages = 1UL << order;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
@@ -677,11 +685,6 @@ static enum scan_result __collapse_huge_
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, referenced = 0;
 	enum scan_result result = SCAN_FAIL;
-	int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
-	unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
-
-	if (max_ptes_none < 0)
-		return SCAN_INVALID_PTES_NONE;
 
 	for (_pte = pte; _pte < pte + nr_pages;
 	     _pte++, addr += PAGE_SIZE) {
@@ -1136,6 +1139,10 @@ static enum scan_result check_pmd_still_
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
+ * For mTHP orders the function bails on the first swap entry, because
+ * faulting pages back in during collapse could re-populate PTEs that
+ * push a later scan over the threshold for a higher-order collapse.
+ *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
  */
@@ -1257,19 +1264,18 @@ static enum scan_result alloc_charge_fol
 		return SCAN_CGROUP_CHARGE_FAIL;
 	}
 
-	count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
+	if (is_pmd_order(order))
+		count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
 
 	*foliop = folio;
 	return SCAN_SUCCEED;
 }
 
 /*
- * collapse_huge_page expects the mmap_read_lock to be dropped before
- * entering this function. The function will also always return with the lock
- * dropped. The function starts by allocation a folio, which can potentially
- * take a long time if it involves sync compaction, and we do not need to hold
- * the mmap_lock during that. We must recheck the vma after taking it again in
- * write mode.
+ * collapse_huge_page expects the mmap_lock to be unlocked before entering and
+ * will always return with the lock unlocked, to avoid holding the mmap_lock
+ * while allocating a THP, as that could trigger direct reclaim/compaction.
+ * Note that the VMA must be rechecked after grabbing the mmap_lock again.
  */
 static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
 		int referenced, int unmapped, struct collapse_control *cc,
@@ -1500,12 +1506,12 @@ static unsigned int collapse_mthp_count_
  * If a collapse is permitted, we attempt to collapse the PTE range into a
  * mTHP.
  */
-static int mthp_collapse(struct mm_struct *mm, unsigned long address,
-		int referenced, int unmapped, struct collapse_control *cc,
-		unsigned long enabled_orders)
+static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, int referenced, int unmapped,
+		struct collapse_control *cc, unsigned long enabled_orders)
 {
-	unsigned int nr_occupied_ptes, nr_ptes;
-	int max_ptes_none, collapsed = 0, stack_size = 0;
+	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
+	int collapsed = 0, stack_size = 0;
 	unsigned long collapse_address;
 	struct mthp_range range;
 	u16 offset;
@@ -1522,10 +1528,7 @@ static int mthp_collapse(struct mm_struc
 		if (!test_bit(order, &enabled_orders))
 			goto next_order;
 
-		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
-
-		if (max_ptes_none < 0)
-			return collapsed;
+		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
 
 		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
 							       nr_ptes);
@@ -1565,7 +1568,7 @@ static int mthp_collapse(struct mm_struc
 		}
 
 next_order:
-		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+		if ((BIT(order) - 1) & enabled_orders) {
 			const u8 next_order = order - 1;
 			const u16 mid_offset = offset + (nr_ptes / 2);
 
@@ -1582,9 +1585,9 @@ static enum scan_result collapse_scan_pm
 		struct vm_area_struct *vma, unsigned long start_addr,
 		bool *lock_dropped, struct collapse_control *cc)
 {
-	int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
 	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
 	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
+	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
 	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
 	pmd_t *pmd;
 	pte_t *pte, *_pte, pteval;
@@ -1772,9 +1775,9 @@ out_unmap:
 	if (result == SCAN_SUCCEED) {
 		/* collapse_huge_page expects the lock to be dropped before calling */
 		mmap_read_unlock(mm);
-		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
-					      cc, enabled_orders);
-		/* collapse_huge_page will return with the mmap_lock released */
+		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
+					     unmapped, cc, enabled_orders);
+		/* mmap_lock was released above, set lock_dropped */
 		*lock_dropped = true;
 		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
 	}
@@ -2665,7 +2668,7 @@ static enum scan_result collapse_scan_fi
 		unsigned long addr, struct file *file, pgoff_t start,
 		struct collapse_control *cc)
 {
-	const int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
+	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
 	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
_


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 20:47 ` Andrew Morton
@ 2026-06-01 15:58   ` Alexander Gordeev
  2026-06-01 17:05     ` Nico Pache
  2026-06-01 17:08     ` Lorenzo Stoakes
  0 siblings, 2 replies; 114+ messages in thread
From: Alexander Gordeev @ 2026-06-01 15:58 UTC (permalink / raw)
  To: Andrew Morton, Gerald Schaefer
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, linux-s390, linux-next

On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:

Hi Andrew et al,

> On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> 
> > The following series provides khugepaged with the capability to collapse
> > anonymous memory regions to mTHPs.
> 
> Thanks, I've update mm.git's mm-unstable branch to this version.
> 
> It sounds like I might be dropping it soon, haven't started looking at
> that yet.  But let's at least eyeball the latest version at this time.
> 
> Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> well, thanks.  The AI checking made a few allegations:

This series appears to cause hangs on s390 in linux-next.
The issue is not easily reproducible, so it is not yet confirmed.
Any ideas for a reliable reproducer that exercises the code path below?

    [ 2749.385719] sysrq: Show Blocked State
    [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
    [ 2749.385735] Call Trace:
    [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
    [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
    [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
    [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
    [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
    [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
    [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
    [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
    [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
    [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
    [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
    [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
    [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
    [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30

Thanks!

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-06-01 15:58   ` Alexander Gordeev
@ 2026-06-01 17:05     ` Nico Pache
  2026-06-01 17:08     ` Lorenzo Stoakes
  1 sibling, 0 replies; 114+ messages in thread
From: Nico Pache @ 2026-06-01 17:05 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, Gerald Schaefer, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
	baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
	david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
	jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	linux-s390, linux-next

On Mon, Jun 1, 2026 at 9:58 AM Alexander Gordeev <agordeev@linux.ibm.com> wrote:
>
> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>
> Hi Andrew et al,
>
> > On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> >
> > Thanks, I've update mm.git's mm-unstable branch to this version.
> >
> > It sounds like I might be dropping it soon, haven't started looking at
> > that yet.  But let's at least eyeball the latest version at this time.
> >
> > Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> > well, thanks.  The AI checking made a few allegations:
>
> This series appears to cause hangs on s390 in linux-next.
> The issue is not easily reproducible, so it is not yet confirmed.
> Any ideas for a reliable reproducer that exercises the code path below?

Hi,

Thanks for the report!

was this caught by syzbot? If so, can you provide a link?

Also can you provide whether any of the mTHP sysfs settings were enabled?

Based on the report, it looks like we are either dealing with more
lock contention (due to holding the write lock longer). We could
switch to a trylock but that might cause us to lose some collapse
attempts (which will be retried later, so probably fine). I'm ok with
that approach if it prevents these potential regressions.

Cheers,
-- Nico

>
>     [ 2749.385719] sysrq: Show Blocked State
>     [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
>     [ 2749.385735] Call Trace:
>     [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
>     [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>     [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>     [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>     [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
>     [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>     [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>     [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>     [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>     [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>     [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
>     [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
>     [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>     [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>
> Thanks!
>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-06-01 15:58   ` Alexander Gordeev
  2026-06-01 17:05     ` Nico Pache
@ 2026-06-01 17:08     ` Lorenzo Stoakes
  2026-06-02  1:53       ` Lance Yang
  1 sibling, 1 reply; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-01 17:08 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, Gerald Schaefer, Nico Pache, linux-doc,
	linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, linux-s390, linux-next

On Mon, Jun 01, 2026 at 05:58:08PM +0200, Alexander Gordeev wrote:
> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>
> Hi Andrew et al,
>
> > On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> >
> > Thanks, I've update mm.git's mm-unstable branch to this version.
> >
> > It sounds like I might be dropping it soon, haven't started looking at
> > that yet.  But let's at least eyeball the latest version at this time.
> >
> > Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> > well, thanks.  The AI checking made a few allegations:
>
> This series appears to cause hangs on s390 in linux-next.
> The issue is not easily reproducible, so it is not yet confirmed.
> Any ideas for a reliable reproducer that exercises the code path below?
>
>     [ 2749.385719] sysrq: Show Blocked State
>     [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
>     [ 2749.385735] Call Trace:
>     [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
>     [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>     [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>     [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>     [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
>     [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>     [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>     [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>     [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>     [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>     [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
>     [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
>     [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>     [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>
> Thanks!

Hi Alexander,

Thanks for the report.

It's a pity it's non-repro, I had Claude have a look at it and it couldn't find
a definite issue with the code at v18, all the locks seem balanced internally.

Things it highlighted FWIW:

- Far more mmap_write_lock()'s being taken - the stack-based approach calls
  colapse_huge_page() multiple times per-PMD each of which entails an mmap read
  lock/unlock and mmap write lock.

- anon_vma write lock held for a much longer period over partial collapse.

So maybe these are triggering issues rather than being the cause of them per-se?

If you happen to see it again could you give the output for:

'echo t > /proc/sysrq-trigger' so we can track who holds the contended lock and
get more details on it?

Also the .config would be useful.

I'm guessing you've also not enabled mTHP in any way on the system?

Repro-wise you could also:

# echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
# echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

To get khugepaged going a more aggressively:

$ for f in /sys/kernel/mm/transparent_hugepage/hugepages-*; do echo always | sudo tee $f/enabled; done

Then maybe some stress-ng like sudo stress-ng --vm 4 --vm-bytes 2G --vm-method
all --timeout 5m (or maybe something more refined :)?

Maybe some of this will help repro more reliably?

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-06-01 17:08     ` Lorenzo Stoakes
@ 2026-06-02  1:53       ` Lance Yang
  0 siblings, 0 replies; 114+ messages in thread
From: Lance Yang @ 2026-06-02  1:53 UTC (permalink / raw)
  To: Lorenzo Stoakes, Alexander Gordeev
  Cc: Andrew Morton, Gerald Schaefer, Nico Pache, linux-doc,
	linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	linux-s390, linux-next



On 2026/6/2 01:08, Lorenzo Stoakes wrote:
> On Mon, Jun 01, 2026 at 05:58:08PM +0200, Alexander Gordeev wrote:
>> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>>
>> Hi Andrew et al,
>>
>>> On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
>>>
>>>> The following series provides khugepaged with the capability to collapse
>>>> anonymous memory regions to mTHPs.
>>>
>>> Thanks, I've update mm.git's mm-unstable branch to this version.
>>>
>>> It sounds like I might be dropping it soon, haven't started looking at
>>> that yet.  But let's at least eyeball the latest version at this time.
>>>
>>> Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
>>> well, thanks.  The AI checking made a few allegations:
>>
>> This series appears to cause hangs on s390 in linux-next.
>> The issue is not easily reproducible, so it is not yet confirmed.
>> Any ideas for a reliable reproducer that exercises the code path below?
>>
>>      [ 2749.385719] sysrq: Show Blocked State
>>      [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
>>      [ 2749.385735] Call Trace:
>>      [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
>>      [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>>      [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>>      [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>>      [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
>>      [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>>      [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>>      [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>>      [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>>      [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>>      [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
>>      [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
>>      [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>>      [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>>
>> Thanks!
> 
> Hi Alexander,
> 
> Thanks for the report.
> 
> It's a pity it's non-repro, I had Claude have a look at it and it couldn't find
> a definite issue with the code at v18, all the locks seem balanced internally.
> 
> Things it highlighted FWIW:
> 
> - Far more mmap_write_lock()'s being taken - the stack-based approach calls
>    colapse_huge_page() multiple times per-PMD each of which entails an mmap read
>    lock/unlock and mmap write lock.
> 
> - anon_vma write lock held for a much longer period over partial collapse.
> 
> So maybe these are triggering issues rather than being the cause of them per-se?
> 
> If you happen to see it again could you give the output for:
> 
> 'echo t > /proc/sysrq-trigger' so we can track who holds the contended lock and
> get more details on it?
> 
> Also the .config would be useful.
> 
> I'm guessing you've also not enabled mTHP in any way on the system?
> 
> Repro-wise you could also:
> 
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> 
> To get khugepaged going a more aggressively:
> 
> $ for f in /sys/kernel/mm/transparent_hugepage/hugepages-*; do echo always | sudo tee $f/enabled; done
> 
> Then maybe some stress-ng like sudo stress-ng --vm 4 --vm-bytes 2G --vm-method
> all --timeout 5m (or maybe something more refined :)?
> 
> Maybe some of this will help repro more reliably?
> 

Cool!

Maybe also worth trying with CONFIG_DETECT_HUNG_TASK=y and
CONFIG_DETECT_HUNG_TASK_BLOCKER=y.

# detect after 10s in D state instead of default 120s
echo 10 > /proc/sys/kernel/hung_task_timeout_secs

# optional: check more often; 0 means same as timeout
echo 0 > /proc/sys/kernel/hung_task_check_interval_secs

With that enabled, the kernel should hopefully tell us which task likely
owns the rwsem. If it is writer-owned, I would expect that to be fairly
reliable.

Cheers, Lance

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
  2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
                   ` (16 preceding siblings ...)
  2026-05-22 20:47 ` Andrew Morton
@ 2026-06-04 10:10 ` Lorenzo Stoakes
  17 siblings, 0 replies; 114+ messages in thread
From: Lorenzo Stoakes @ 2026-06-04 10:10 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe

Not sure if already addressed for v19 but I just tried building a kernel in
mm-unstable and saw this (clang 22.1.6):

mm/khugepaged.c:1357:6: warning: variable 'pte' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
 1357 |         if (result != SCAN_SUCCEED)
      |             ^~~~~~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1458:6: note: uninitialized use occurs here
 1458 |         if (pte)
      |             ^~~
mm/khugepaged.c:1357:2: note: remove the 'if' if its condition is always false
 1357 |         if (result != SCAN_SUCCEED)
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
 1358 |                 goto out_up_write;
      |                 ~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1352:6: warning: variable 'pte' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
 1352 |         if (result != SCAN_SUCCEED)
      |             ^~~~~~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1458:6: note: uninitialized use occurs here
 1458 |         if (pte)
      |             ^~~
mm/khugepaged.c:1352:2: note: remove the 'if' if its condition is always false
 1352 |         if (result != SCAN_SUCCEED)
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
 1353 |                 goto out_up_write;
      |                 ~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1296:12: note: initialize the variable 'pte' to silence this warning
 1296 |         pte_t *pte;
      |                   ^
      |                    = NULL
2 warnings generated.


Is this already addressed in v19/review here? If not could you please address it.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 114+ messages in thread

end of thread, other threads:[~2026-06-05 16:05 UTC | newest]

Thread overview: 114+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
2026-05-22 14:59 ` [PATCH mm-unstable v18 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2026-05-22 14:59 ` [PATCH mm-unstable v18 02/14] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
2026-05-22 14:59 ` [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
2026-05-22 21:16   ` David Hildenbrand (Arm)
2026-06-01 13:26   ` Lorenzo Stoakes
2026-06-05 16:04   ` Zi Yan
2026-05-22 14:59 ` [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-05-22 21:24   ` David Hildenbrand (Arm)
2026-05-26 14:39   ` Nico Pache
2026-06-01 14:04   ` Lorenzo Stoakes
2026-05-22 15:00 ` [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
2026-06-01 14:07   ` Lorenzo Stoakes
2026-06-02 10:26     ` Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-05-22 21:47   ` David Hildenbrand (Arm)
2026-05-26 14:42   ` Nico Pache
2026-05-31  9:39   ` Lance Yang
2026-05-31 20:00     ` David Hildenbrand (Arm)
2026-06-01  3:28       ` Lance Yang
2026-06-01  6:54         ` David Hildenbrand (Arm)
2026-06-01  7:49           ` Lance Yang
2026-06-01  8:15             ` David Hildenbrand (Arm)
2026-06-01  8:44               ` Lance Yang
2026-06-01 10:09                 ` David Hildenbrand (Arm)
2026-06-01  9:08           ` Lance Yang
2026-06-01 10:23             ` David Hildenbrand (Arm)
2026-06-01 10:47               ` Lance Yang
2026-06-01 11:13                 ` David Hildenbrand (Arm)
2026-06-01 15:00                   ` Nico Pache
2026-06-01 15:05                     ` David Hildenbrand (Arm)
2026-06-01 16:07                       ` Lance Yang
2026-06-04 17:04                     ` Nico Pache
2026-06-04 18:12                       ` Lorenzo Stoakes
2026-06-05  7:18                       ` David Hildenbrand (Arm)
2026-06-05  8:07                         ` Lorenzo Stoakes
2026-06-05  8:59                           ` Lance Yang
2026-06-02 15:30                 ` Nico Pache
2026-06-02 16:34                   ` Lance Yang
2026-06-04 12:33           ` Lorenzo Stoakes
2026-06-04 10:21   ` Lorenzo Stoakes
2026-06-04 10:32     ` Nico Pache
2026-06-04 11:38   ` Lorenzo Stoakes
2026-06-04 12:39     ` Lorenzo Stoakes
2026-06-04 12:45       ` Nico Pache
2026-06-04 12:55         ` Lorenzo Stoakes
2026-06-04 16:28           ` Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-05-22 21:51   ` David Hildenbrand (Arm)
2026-05-22 15:00 ` [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-05-31 20:09   ` David Hildenbrand (Arm)
2026-06-01 14:13   ` Lorenzo Stoakes
2026-05-22 15:00 ` [PATCH mm-unstable v18 09/14] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2026-05-31 20:18   ` David Hildenbrand (Arm)
2026-06-01 14:35     ` Lorenzo Stoakes
2026-06-01 14:40       ` David Hildenbrand (Arm)
2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
2026-05-25 14:15   ` Nico Pache
2026-05-25 19:10     ` Andrew Morton
2026-05-26  6:57       ` Wei Yang
2026-05-26 12:07         ` Nico Pache
2026-05-28  8:42           ` Wei Yang
2026-05-28 17:11             ` Nico Pache
2026-05-31  7:18   ` Lance Yang
2026-05-31  8:48     ` Lance Yang
2026-06-01 12:01       ` Nico Pache
2026-06-01 12:06         ` David Hildenbrand (Arm)
2026-06-02 10:58     ` Nico Pache
2026-06-02 15:44       ` Lance Yang
2026-06-03  8:05         ` David Hildenbrand (Arm)
2026-06-04 14:40           ` Lorenzo Stoakes
2026-06-01  8:11   ` David Hildenbrand (Arm)
2026-06-01 12:40     ` Nico Pache
2026-06-01 13:15       ` David Hildenbrand (Arm)
2026-06-02 17:23         ` Nico Pache
2026-06-02 17:26           ` Nico Pache
2026-06-03  9:55           ` David Hildenbrand (Arm)
2026-06-03 10:00           ` David Hildenbrand (Arm)
2026-06-03 12:16             ` Nico Pache
2026-06-03 12:27               ` David Hildenbrand (Arm)
2026-06-04 14:14           ` Lorenzo Stoakes
2026-06-04 14:19             ` Lorenzo Stoakes
2026-06-04 13:53     ` Lorenzo Stoakes
2026-06-04 13:59       ` Lorenzo Stoakes
2026-06-04 14:45   ` Lorenzo Stoakes
2026-06-05 11:07     ` Nico Pache
2026-06-05 11:08       ` Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-05-31  7:31   ` Lance Yang
2026-05-31 20:02     ` David Hildenbrand (Arm)
2026-06-01  1:53       ` Lance Yang
2026-05-22 15:00 ` [PATCH mm-unstable v18 13/14] mm/khugepaged: run khugepaged for all orders Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-05-22 21:58   ` David Hildenbrand (Arm)
2026-05-26 12:00     ` Nico Pache
2026-05-26 14:45   ` Nico Pache
2026-05-22 15:07 ` [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
2026-05-22 15:13   ` Vlastimil Babka (SUSE)
2026-05-22 16:11     ` Nico Pache
2026-05-22 21:13       ` David Hildenbrand (Arm)
2026-05-22 15:16   ` Lorenzo Stoakes
2026-05-22 16:08     ` Nico Pache
2026-05-22 16:19       ` Lorenzo Stoakes
2026-05-22 16:31         ` Nico Pache
2026-05-22 17:12           ` Lorenzo Stoakes
2026-05-26  8:14             ` Lorenzo Stoakes
2026-05-22 15:13 ` Lorenzo Stoakes
2026-05-22 20:47 ` Andrew Morton
2026-06-01 15:58   ` Alexander Gordeev
2026-06-01 17:05     ` Nico Pache
2026-06-01 17:08     ` Lorenzo Stoakes
2026-06-02  1:53       ` Lance Yang
2026-06-04 10:10 ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox