[PATCH v9 00/14] khugepaged: mTHP support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v9 00/14] khugepaged: mTHP support
@ 2025-07-14  0:31 Nico Pache
  2025-07-14  0:31 ` [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
                   ` (14 more replies)
  0 siblings, 15 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:31 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
PMD scan is done, we do binary recursion on the bitmap to find the optimal
mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
during the scan, to make sure we account for the whole PMD range. When no
mTHP size is enabled, the legacy behavior of khugepaged is maintained.
max_ptes_none will be scaled by the attempted collapse order to determine
how full a mTHP must be to be eligible for the collapse to occur. If a
mTHP collapse is attempted, but contains swapped out, or shared pages, we
don't perform the collapse. It is now also possible to collapse to mTHPs
without requiring the PMD THP size to be enabled.

With the default max_ptes_none=511, the code should keep its most of its
original behavior. When enabling multiple adjacent (m)THP sizes we need to
set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
experience collapse "creep" and constantly promote mTHPs to the next
available size. This is due the fact that a collapse will introduce at
least 2x the number of pages, and on a future scan will satisfy the
promotion condition once again.

Patch 1:     Refactor/rename hpage_collapse
Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
Patch 3-5:   Generalize khugepaged functions for arbitrary orders
Patch 6-9:   The mTHP patches
Patch 10-11: Allow khugepaged to operate without PMD enabled
Patch 12-13: Tracing/stats
Patch 14:    Documentation

---------
 Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits
   while monitoring a number of stats and tracepoints. The code is
   available here[1] (Run in legacy mode for these changes and set mthp
   sizes to inherit)
   The summary from my testings was that there was no significant
   regression noticed through this test. In some cases my changes had
   better collapse latencies, and was able to scan more pages in the same
   amount of time/work, but for the most part the results were consistent.
- redis testing. I tested these changes along with my defer changes
  (see followup [4] post for more details). We've decided to get the mTHP
  changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.

V9 Changes:
- Drop madvise_collapse support [2]. Further discussion needed.
- Add documentation entries for new stats (Baolin)
- Fix missing stat update (MTHP_STAT_COLLAPSE_EXCEED_SWAP) that was
  accidentally dropped in v7 (Baolin)
- Fix mishandled conflict noted in v8 (merged into wrong commit)
- change rename from khugepaged to collapse (Dev)

V8 Changes: [3]
- Fix mishandled conflict with shmem config changes (Baolin)
- Add Baolin's patches for allowing collapse without PMD enabled
- Add additional patch for allowing madvise_collapse without PMD enabled
- Documentations nits (Randy)
- Simplify SCAN_ANY_PROCESS lock jumbling (Liam)
- Add a BUG_ON to the mTHP collapse similar to PMD (Dev)
- Remove doc comment about khugepaged PMD only limitation (Dev)
- Change revalidation function to accept multiple orders
- Handled conflicts introduced by Lorenzo's madvise changes

V7 (RESEND)

V6 Changes:
- Dont release the anon_vma_lock early (like in the PMD case), as not all
  pages are isolated.
- Define the PTE as null to avoid a uninitilized condition
- minor nits and newline cleanup
- make sure to unmap and unlock the pte for the swapin case
- change the revalidation to always check the PMD order (as this will make
  sure that no other VMA spans it)

V5 Changes:
- switched the order of patches 1 and 2
- fixed some edge cases on the unified madvise_collapse and khugepaged
- Explained the "creep" some more in the docs
- fix EXCEED_SHARED vs EXCEED_SWAP accounting issue
- fix potential highmem issue caused by a early unmap of the PTE

V4 Changes:
- Rebased onto mm-unstable
- small changes to Documentation

V3 Changes:
- corrected legacy behavior for khugepaged and madvise_collapse
- added proper mTHP stat tracking
- Minor changes to prevent a nested lock on non-split-lock arches
- Took Devs version of alloc_charge_folio as it has the proper stats
- Skip cases were trying to collapse to a lower order would still fail
- Fixed cases were the bitmap was not being updated properly
- Moved Documentation update to this series instead of the defer set
- Minor bugs discovered during testing and review
- Minor "nit" cleanup

V2 Changes:
- Minor bug fixes discovered during review and testing
- removed dynamic allocations for bitmaps, and made them stack based
- Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
- Updated trace events to include collapsing order info.
- Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
- No longer require a chunk to be fully utilized before setting the bit.
   Use the same max_ptes_none scaling principle to achieve this.
- Skip mTHP collapse that requires swapin or shared handling. This helps
   prevent some of the "creep" that was discovered in v1.

A big thanks to everyone that has reviewed, tested, and participated in
the development process. Its been a great experience working with all of
you on this long endeavour.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/all/23b8ad10-cd1f-45df-a25c-78d01c8af44f@redhat.com/
[3] - https://lore.kernel.org/lkml/20250702055742.102808-1-npache@redhat.com/
[4] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

Baolin Wang (2):
  khugepaged: allow khugepaged to check all anonymous mTHP orders
  khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs

Dev Jain (1):
  khugepaged: generalize alloc_charge_folio()

Nico Pache (11):
  khugepaged: rename hpage_collapse_* to collapse_*
  introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce collapse_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: skip collapsing mTHP to smaller orders
  khugepaged: avoid unnecessary mTHP collapse attempts
  khugepaged: improve tracepoints for mTHP orders
  khugepaged: add per-order mTHP khugepaged stats
  Documentation: mm: update the admin guide for mTHP collapse

 Documentation/admin-guide/mm/transhuge.rst |  44 +-
 include/linux/huge_mm.h                    |   5 +
 include/linux/khugepaged.h                 |   4 +
 include/trace/events/huge_memory.h         |  34 +-
 mm/huge_memory.c                           |  11 +
 mm/khugepaged.c                            | 514 ++++++++++++++-------
 6 files changed, 434 insertions(+), 178 deletions(-)

-- 
2.50.0



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_*
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
@ 2025-07-14  0:31 ` Nico Pache
  2025-07-15 15:39   ` David Hildenbrand
                     ` (2 more replies)
  2025-07-14  0:31 ` [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:31 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

The hpage_collapse functions describe functions used by madvise_collapse
and khugepaged. remove the unnecessary hpage prefix to shorten the
function name.

Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a55fb1dcd224..eb0babb51868 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
 	kmem_cache_destroy(mm_slot_cache);
 }
 
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int collapse_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 {
-	return hpage_collapse_test_exit(mm) ||
+	return collapse_test_exit(mm) ||
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
@@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 	int wakeup;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	VM_BUG_ON_MM(collapse_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
 		return;
 
@@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * collapse_test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we return all
 		 * pagetables will be destroyed) until khugepaged has finished
 		 * working on the pagetables under the mmap_lock.
@@ -838,7 +838,7 @@ struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool collapse_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -873,7 +873,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -892,7 +892,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -912,7 +912,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(collapse_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -985,7 +985,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1071,7 +1071,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
-	int node = hpage_collapse_find_target_node(cc);
+	int node = collapse_find_target_node(cc);
 	struct folio *folio;
 
 	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1257,7 +1257,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+static int collapse_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
 				   struct collapse_control *cc)
@@ -1371,7 +1371,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1440,7 +1440,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (collapse_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
@@ -1733,7 +1733,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 
-		if (hpage_collapse_test_exit(mm))
+		if (collapse_test_exit(mm))
 			continue;
 		/*
 		 * When a vma is registered with uffd-wp, we cannot recycle
@@ -2255,7 +2255,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2312,7 +2312,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			folio_put(folio);
 			break;
@@ -2362,7 +2362,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
+static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
@@ -2400,7 +2400,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(collapse_test_exit_or_disable(mm)))
 		goto breakouterloop;
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2408,7 +2408,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+		if (unlikely(collapse_test_exit_or_disable(mm))) {
 			progress++;
 			break;
 		}
@@ -2430,7 +2430,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(collapse_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2490,7 +2490,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (collapse_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2544,7 +2544,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 			pass_through_head++;
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
-			progress += khugepaged_scan_mm_slot(pages - progress,
+			progress += collapse_scan_mm_slot(pages - progress,
 							    &result, cc);
 		else
 			progress = pages;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
  2025-07-14  0:31 ` [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
@ 2025-07-14  0:31 ` Nico Pache
  2025-07-15 15:53   ` David Hildenbrand
  2025-07-16 15:12   ` Liam R. Howlett
  2025-07-14  0:31 ` [PATCH v9 03/14] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:31 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.

Create collapse_single_pmd to increase code reuse and create an entry
point to these two users.

Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. This introduces a minor behavioral change
that is most likely an undiscovered bug. The current implementation of
khugepaged tests collapse_test_exit_or_disable before calling
collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
case. By unifying these two callers madvise_collapse now also performs
this check.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 95 +++++++++++++++++++++++++------------------------
 1 file changed, 49 insertions(+), 46 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eb0babb51868..47a80638af97 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2362,6 +2362,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int collapse_single_pmd(unsigned long addr,
+				   struct vm_area_struct *vma, bool *mmap_locked,
+				   struct collapse_control *cc)
+{
+	int result = SCAN_FAIL;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (!vma_is_anonymous(vma)) {
+		struct file *file = get_file(vma->vm_file);
+		pgoff_t pgoff = linear_page_index(vma, addr);
+
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+		result = collapse_scan_file(mm, addr, file, pgoff, cc);
+		fput(file);
+		if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+			mmap_read_lock(mm);
+			*mmap_locked = true;
+			if (collapse_test_exit_or_disable(mm)) {
+				mmap_read_unlock(mm);
+				*mmap_locked = false;
+				result = SCAN_ANY_PROCESS;
+				goto end;
+			}
+			result = collapse_pte_mapped_thp(mm, addr,
+							 !cc->is_khugepaged);
+			if (result == SCAN_PMD_MAPPED)
+				result = SCAN_SUCCEED;
+			mmap_read_unlock(mm);
+			*mmap_locked = false;
+		}
+	} else {
+		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
+	}
+	if (cc->is_khugepaged && result == SCAN_SUCCEED)
+		++khugepaged_pages_collapsed;
+end:
+	return result;
+}
+
 static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2436,34 +2480,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (!vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
-
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
 
+			*result = collapse_single_pmd(khugepaged_scan.address,
+						vma, &mmap_locked, cc);
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
@@ -2780,35 +2799,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
 		nodes_clear(cc->alloc_nmask);
-		if (!vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
-			fput(file);
-		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
-		}
+		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
+
 		if (!mmap_locked)
 			*lock_dropped = true;
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
-		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			mmap_read_lock(mm);
-			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
+		case SCAN_PTE_MAPPED_HUGEPAGE:
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 03/14] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
  2025-07-14  0:31 ` [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
  2025-07-14  0:31 ` [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-07-14  0:31 ` Nico Pache
  2025-07-15 15:55   ` David Hildenbrand
  2025-07-14  0:31 ` [PATCH v9 04/14] khugepaged: generalize alloc_charge_folio() Nico Pache
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:31 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

For khugepaged to support different mTHP orders, we must generalize this
to check if the PMD is not shared by another VMA and the order is enabled.

To ensure madvise_collapse can support working on mTHP orders without the
PMD order enabled, we need to convert hugepage_vma_revalidate to take a
bitmap of orders.

No functional change in this patch.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 47a80638af97..fa0642e66790 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -907,7 +907,7 @@ static int collapse_find_target_node(struct collapse_control *cc)
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   bool expect_anon,
 				   struct vm_area_struct **vmap,
-				   struct collapse_control *cc)
+				   struct collapse_control *cc, unsigned long orders)
 {
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
@@ -919,9 +919,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
+	/* Always check the PMD order to insure its not shared by another VMA */
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_orders(vma, vma->vm_flags, tva_flags, orders))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1123,7 +1124,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 BIT(HPAGE_PMD_ORDER));
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1157,7 +1159,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 BIT(HPAGE_PMD_ORDER));
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2788,7 +2791,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 cc, BIT(HPAGE_PMD_ORDER));
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 04/14] khugepaged: generalize alloc_charge_folio()
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (2 preceding siblings ...)
  2025-07-14  0:31 ` [PATCH v9 03/14] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-07-14  0:31 ` Nico Pache
  2025-07-16 13:46   ` David Hildenbrand
  2025-07-14  0:31 ` [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:31 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

From: Dev Jain <dev.jain@arm.com>

Pass order to alloc_charge_folio() and update mTHP statistics.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
 include/linux/huge_mm.h                    |  2 ++
 mm/huge_memory.c                           |  4 ++++
 mm/khugepaged.c                            | 17 +++++++++++------
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index dff8d5985f0f..2c523dce6bc7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -583,6 +583,14 @@ anon_fault_fallback_charge
 	instead falls back to using huge pages with lower orders or
 	small pages even though the allocation was successful.
 
+collapse_alloc
+	is incremented every time a huge page is successfully allocated for a
+	khugepaged collapse.
+
+collapse_alloc_failed
+	is incremented every time a huge page allocation fails during a
+	khugepaged collapse.
+
 zswpout
 	is incremented every time a huge page is swapped out to zswap in one
 	piece without splitting.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..4042078e8cc9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -125,6 +125,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_COLLAPSE_ALLOC,
+	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bd7a623d7ef8..e2ed9493df77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -614,6 +614,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -679,6 +681,8 @@ static struct attribute *any_stats_attrs[] = {
 #endif
 	&split_attr.attr,
 	&split_failed_attr.attr,
+	&collapse_alloc_attr.attr,
+	&collapse_alloc_failed_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa0642e66790..cc9a35185604 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1068,21 +1068,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, u8 order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = collapse_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (order == HPAGE_PMD_ORDER)
+			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
-	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (order == HPAGE_PMD_ORDER)
+		count_vm_event(THP_COLLAPSE_ALLOC);
+	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
@@ -1119,7 +1124,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1843,7 +1848,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (3 preceding siblings ...)
  2025-07-14  0:31 ` [PATCH v9 04/14] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2025-07-14  0:31 ` Nico Pache
  2025-07-16 13:52   ` David Hildenbrand
                     ` (2 more replies)
  2025-07-14  0:31 ` [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap " Nico Pache
                   ` (9 subsequent siblings)
  14 siblings, 3 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:31 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.

mTHP collapse can suffer from incosistant behavior, and memory waste
"creep". disable swapin and shared support for mTHP collapse.

No functional changes in this patch.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 49 +++++++++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cc9a35185604..ee54e3c1db4e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -552,15 +552,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
 					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+					struct list_head *compound_pagelist,
+					u8 order)
 {
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -568,7 +570,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= scaled_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -596,8 +598,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared)) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -698,13 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist,
+						u8 order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
@@ -751,7 +754,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 					     pmd_t *pmd,
 					     pmd_t orig_pmd,
 					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+					     struct list_head *compound_pagelist,
+					     u8 order)
 {
 	spinlock_t *pmd_ptl;
 
@@ -768,7 +772,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
 }
 
 /*
@@ -789,7 +793,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, u8 order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -797,7 +801,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -816,10 +820,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 compound_pagelist, order);
 
 	return result;
 }
@@ -994,11 +998,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
 				       struct vm_area_struct *vma,
 				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+				       int referenced, u8 order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = haddr + (PAGE_SIZE << order);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1029,6 +1033,15 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		if (!is_swap_pte(vmf.orig_pte))
 			continue;
 
+		/* Dont swapin for mTHP collapse */
+		if (order != HPAGE_PMD_ORDER) {
+			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+			pte_unmap(pte);
+			mmap_read_unlock(mm);
+			result = SCAN_EXCEED_SWAP_PTE;
+			goto out;
+		}
+
 		vmf.pte = pte;
 		vmf.ptl = ptl;
 		ret = do_swap_page(&vmf);
@@ -1149,7 +1162,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+				referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1197,7 +1210,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist);
+					&compound_pagelist, HPAGE_PMD_ORDER);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1227,7 +1240,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap for mTHP support
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (4 preceding siblings ...)
  2025-07-14  0:31 ` [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2025-07-14  0:31 ` Nico Pache
  2025-07-16 14:03   ` David Hildenbrand
  2025-07-16 15:38   ` Liam R. Howlett
  2025-07-14  0:32 ` [PATCH v9 07/14] khugepaged: add " Nico Pache
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:31 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

khugepaged scans anons PMD ranges for potential collapse to a hugepage.
To add mTHP support we use this scan to instead record chunks of utilized
sections of the PMD.

collapse_scan_bitmap uses a stack struct to recursively scan a bitmap
that represents chunks of utilized regions. We can then determine what
mTHP size fits best and in the following patch, we set this bitmap while
scanning the anon PMD. A minimum collapse order of 2 is used as this is
the lowest order supported by anon memory.

max_ptes_none is used as a scale to determine how "full" an order must
be before being considered for collapse.

When attempting to collapse an order that has its order set to "always"
lets always collapse to that order in a greedy manner without
considering the number of bits set.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/khugepaged.h |  4 ++
 mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 9 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index ff6120463745..0f957711a117 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -1,6 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
+#define KHUGEPAGED_MIN_MTHP_ORDER	2
+#define KHUGEPAGED_MIN_MTHP_NR	(1<<KHUGEPAGED_MIN_MTHP_ORDER)
+#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
+#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
 
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ee54e3c1db4e..59b2431ca616 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+struct scan_bit_state {
+	u8 order;
+	u16 offset;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -102,6 +107,18 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/*
+	 * bitmap used to collapse mTHP sizes.
+	 * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
+	 */
+	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
+	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
+};
+
+struct collapse_control khugepaged_collapse_control = {
+	.is_khugepaged = true,
 };
 
 /**
@@ -838,10 +855,6 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-struct collapse_control khugepaged_collapse_control = {
-	.is_khugepaged = true,
-};
-
 static bool collapse_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
@@ -1115,7 +1128,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, bool *mmap_locked,
+				  u8 order, u16 offset)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1134,8 +1148,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * The allocation can take potentially a long time if it involves
 	 * sync compaction, and we do not need to hold the mmap_lock during
 	 * that. We will recheck the vma after taking it again in write mode.
+	 * If collapsing mTHPs we may have already released the read_lock.
 	 */
-	mmap_read_unlock(mm);
+	if (*mmap_locked) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+	}
 
 	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
@@ -1272,12 +1290,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
+	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
 	return result;
 }
 
+/* Recursive function to consume the bitmap */
+static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
+			int referenced, int unmapped, struct collapse_control *cc,
+			bool *mmap_locked, unsigned long enabled_orders)
+{
+	u8 order, next_order;
+	u16 offset, mid_offset;
+	int num_chunks;
+	int bits_set, threshold_bits;
+	int top = -1;
+	int collapsed = 0;
+	int ret;
+	struct scan_bit_state state;
+	bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
+	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+		{ HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
+
+	while (top >= 0) {
+		state = cc->mthp_bitmap_stack[top--];
+		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
+		offset = state.offset;
+		num_chunks = 1 << (state.order);
+		// Skip mTHP orders that are not enabled
+		if (!test_bit(order, &enabled_orders))
+			goto next;
+
+		// copy the relavant section to a new bitmap
+		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
+				  MTHP_BITMAP_SIZE);
+
+		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
+		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
+				>> (HPAGE_PMD_ORDER - state.order);
+
+		//Check if the region is "almost full" based on the threshold
+		if (bits_set > threshold_bits || is_pmd_only
+			|| test_bit(order, &huge_anon_orders_always)) {
+			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
+					mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
+			if (ret == SCAN_SUCCEED) {
+				collapsed += (1 << order);
+				continue;
+			}
+		}
+
+next:
+		if (state.order > 0) {
+			next_order = state.order - 1;
+			mid_offset = offset + (num_chunks / 2);
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, mid_offset };
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, offset };
+			}
+	}
+	return collapsed;
+}
+
 static int collapse_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
@@ -1444,9 +1522,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 07/14] khugepaged: add mTHP support
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (5 preceding siblings ...)
  2025-07-14  0:31 ` [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap " Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-14  0:32 ` [PATCH v9 08/14] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning PMD ranges for potential collapse candidates, keep track
of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
mTHPs are enabled we remove the restriction of max_ptes_none during the
scan phase so we dont bailout early and miss potential mTHP candidates.

After the scan is complete we will perform binary recursion on the
bitmap to determine which mTHP size would be most efficient to collapse
to. max_ptes_none will be scaled by the attempted collapse order to
determine how full a THP must be to be eligible.

If a mTHP collapse is attempted, but contains swapped out, or shared
pages, we dont perform the collapse.

For non-PMD collapse we must leave the anon VMA write locked until after
we collapse the mTHP-- in the PMD case all the pages are isolated, but in
the non-PMD case this is not true, and we must keep the lock to prevent
changes to the VMA from occurring.

Currently madv_collapse is not supported and will only attempt PMD
collapse.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 142 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 99 insertions(+), 43 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 59b2431ca616..5d7c5be9097e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1133,13 +1133,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte = NULL, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	unsigned long _address = address + offset * PAGE_SIZE;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1155,13 +1156,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		*mmap_locked = false;
 	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 BIT(HPAGE_PMD_ORDER));
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1179,13 +1180,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-				referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
+				referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1195,8 +1197,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 BIT(HPAGE_PMD_ORDER));
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1207,11 +1208,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
+				_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+
 	/*
 	 * This removes any huge TLB entry from the CPU so we won't allow
 	 * huge and small TLB entries for the same virtual address to
@@ -1225,18 +1227,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-					&compound_pagelist, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
+					&compound_pagelist, order);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1251,17 +1251,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	}
 
 	/*
-	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
+	 * For PMD collapse all pages are isolated and locked so anon_vma
+	 * rmap can't run anymore
 	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	if (order == HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   &compound_pagelist, HPAGE_PMD_ORDER);
-	pte_unmap(pte);
+					   vma, _address, pte_ptl,
+					   &compound_pagelist, order);
 	if (unlikely(result != SCAN_SUCCEED))
-		goto out_up_write;
+		goto out_unlock_anon_vma;
 
 	/*
 	 * The smp_wmb() inside __folio_mark_uptodate() ensures the
@@ -1269,25 +1269,46 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
-	_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
-	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	set_pmd_at(mm, address, pmd, _pmd);
-	update_mmu_cache_pmd(vma, address, pmd);
-	deferred_split_folio(folio, false);
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
+		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, address, pmd, _pmd);
+		update_mmu_cache_pmd(vma, address, pmd);
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	} else { /* mTHP collapse */
+		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		folio_ref_add(folio, (1 << order) - 1);
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
+		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
+
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
+out_unlock_anon_vma:
+	if (order != HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 out_up_write:
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
 out_nolock:
 	*mmap_locked = false;
@@ -1363,31 +1384,60 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
+	int i;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long _address;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
+	bool is_pmd_only;
 	bool writable = false;
-
+	int chunk_none_count = 0;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	if (cc->is_khugepaged)
+		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			tva_flags, THP_ORDERS_ALL_ANON);
+	else
+		enabled_orders = BIT(HPAGE_PMD_ORDER);
+
+	is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, _address += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		/*
+		 * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
+		 * there are pages in this chunk keep track of it in the bitmap
+		 * for mTHP collapsing.
+		 */
+		if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
+			if (chunk_none_count <= scaled_none)
+				bitmap_set(cc->mthp_bitmap,
+					   i / KHUGEPAGED_MIN_MTHP_NR, 1);
+			chunk_none_count = 0;
+		}
+
+		_pte = pte + i;
+		_address = address + i * PAGE_SIZE;
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
@@ -1410,10 +1460,11 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 			}
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			++chunk_none_count;
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			    (!cc->is_khugepaged || !is_pmd_only ||
+				none_or_zero <= khugepaged_max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1509,6 +1560,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 								     address)))
 			referenced++;
 	}
+
 	if (!writable) {
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
@@ -1521,8 +1573,12 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
+		result = collapse_scan_bitmap(mm, address, referenced, unmapped, cc,
+			       mmap_locked, enabled_orders);
+		if (result > 0)
+			result = SCAN_SUCCEED;
+		else
+			result = SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 08/14] khugepaged: skip collapsing mTHP to smaller orders
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (6 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 07/14] khugepaged: add " Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-16 14:32   ` David Hildenbrand
  2025-07-14  0:32 ` [PATCH v9 09/14] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5d7c5be9097e..a701d9f0f158 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -612,7 +612,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
-		/* See hpage_collapse_scan_pmd(). */
+		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
+
+		/* See khugepaged_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
 			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 09/14] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (7 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 08/14] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-18  2:14   ` Baolin Wang
  2025-07-14  0:32 ` [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders Nico Pache
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a701d9f0f158..7a9c4edf0e23 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1367,6 +1367,23 @@ static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
 				collapsed += (1 << order);
 				continue;
 			}
+			/*
+			 * Some ret values indicate all lower order will also
+			 * fail, dont trying to collapse smaller orders
+			 */
+			if (ret == SCAN_EXCEED_NONE_PTE ||
+				ret == SCAN_EXCEED_SWAP_PTE ||
+				ret == SCAN_EXCEED_SHARED_PTE ||
+				ret == SCAN_PTE_NON_PRESENT ||
+				ret == SCAN_PTE_UFFD_WP ||
+				ret == SCAN_ALLOC_HUGE_PAGE_FAIL ||
+				ret == SCAN_CGROUP_CHARGE_FAIL ||
+				ret == SCAN_COPY_MC ||
+				ret == SCAN_PAGE_LOCK ||
+				ret == SCAN_PAGE_COUNT)
+				goto next;
+			else
+				break;
 		}
 
 next:
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (8 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 09/14] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-16 15:28   ` David Hildenbrand
  2025-07-14  0:32 ` [PATCH v9 11/14] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

From: Baolin Wang <baolin.wang@linux.alibaba.com>

We have now allowed mTHP collapse, but thp_vma_allowable_order() still only
checks if the PMD-sized mTHP is allowed to collapse. This prevents scanning
and collapsing of 64K mTHP when only 64K mTHP is enabled. Thus, we should
modify the checks to allow all large orders of anonymous mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7a9c4edf0e23..3772dc0d78ea 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -491,8 +491,11 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+		unsigned long orders = vma_is_anonymous(vma) ?
+					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
+
+		if (thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
+					    orders))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -2624,6 +2627,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
 	for_each_vma(vmi, vma) {
+		unsigned long orders = vma_is_anonymous(vma) ?
+					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
 		unsigned long hstart, hend;
 
 		cond_resched();
@@ -2631,8 +2636,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+		if (!thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_ENFORCE_SYSFS, orders)) {
 skip:
 			progress++;
 			continue;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 11/14] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (9 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-14  0:32 ` [PATCH v9 12/14] khugepaged: improve tracepoints for mTHP orders Nico Pache
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

From: Baolin Wang <baolin.wang@linux.alibaba.com>

When only non-PMD-sized mTHP is enabled (such as only 64K mTHP enabled),
we should also allow kicking khugepaged to attempt scanning and collapsing
64K mTHP. Modify hugepage_pmd_enabled() to support mTHP collapse, and
while we are at it, rename it to make the function name more clear.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3772dc0d78ea..65cb8c58bbf8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -430,7 +430,7 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
@@ -442,11 +442,11 @@ static bool hugepage_pmd_enabled(void)
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+	if (READ_ONCE(huge_anon_orders_always))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+	if (READ_ONCE(huge_anon_orders_madvise))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+	if (READ_ONCE(huge_anon_orders_inherit) &&
 	    hugepage_global_enabled())
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -490,7 +490,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
-	    hugepage_pmd_enabled()) {
+	    hugepage_enabled()) {
 		unsigned long orders = vma_is_anonymous(vma) ?
 					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
 
@@ -2714,7 +2714,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+	return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
 }
 
 static int khugepaged_wait_event(void)
@@ -2787,7 +2787,7 @@ static void khugepaged_wait_work(void)
 		return;
 	}
 
-	if (hugepage_pmd_enabled())
+	if (hugepage_enabled())
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
@@ -2818,7 +2818,7 @@ static void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!hugepage_enabled()) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -2868,7 +2868,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (hugepage_enabled()) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2894,7 +2894,7 @@ int start_stop_khugepaged(void)
 void khugepaged_min_free_kbytes_update(void)
 {
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled() && khugepaged_thread)
+	if (hugepage_enabled() && khugepaged_thread)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 12/14] khugepaged: improve tracepoints for mTHP orders
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (10 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 11/14] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-22 15:39   ` David Hildenbrand
  2025-07-14  0:32 ` [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats Nico Pache
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

Add the order to the tracepoints to give better insight into what order
is being operated at for khugepaged.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
 mm/khugepaged.c                    | 10 +++++----
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 2305df6cb485..70661bbf676f 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -92,34 +92,37 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 TRACE_EVENT(mm_collapse_huge_page,
 
-	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+	TP_PROTO(struct mm_struct *mm, int isolated, int status, int order),
 
-	TP_ARGS(mm, isolated, status),
+	TP_ARGS(mm, isolated, status, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, isolated)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->isolated = isolated;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, isolated=%d, status=%s",
+	TP_printk("mm=%p, isolated=%d, status=%s order=%d",
 		__entry->mm,
 		__entry->isolated,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct folio *folio, int none_or_zero,
-		 int referenced, bool  writable, int status),
+		 int referenced, bool  writable, int status, int order),
 
-	TP_ARGS(folio, none_or_zero, referenced, writable, status),
+	TP_ARGS(folio, none_or_zero, referenced, writable, status, order),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
@@ -127,6 +130,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__field(int, referenced)
 		__field(bool, writable)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -135,27 +139,31 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->referenced = referenced;
 		__entry->writable = writable;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s order=%d",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
 		__entry->writable,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_swapin,
 
-	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+			int order),
 
-	TP_ARGS(mm, swapped_in, referenced, ret),
+	TP_ARGS(mm, swapped_in, referenced, ret, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, swapped_in)
 		__field(int, referenced)
 		__field(int, ret)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -163,13 +171,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->swapped_in = swapped_in;
 		__entry->referenced = referenced;
 		__entry->ret = ret;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%d",
 		__entry->mm,
 		__entry->swapped_in,
 		__entry->referenced,
-		__entry->ret)
+		__entry->ret,
+		__entry->order)
 );
 
 TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 65cb8c58bbf8..d0c99b86b304 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -711,13 +711,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-						    referenced, writable, result);
+						    referenced, writable, result,
+						    order);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-					    referenced, writable, result);
+					    referenced, writable, result, order);
 	return result;
 }
 
@@ -1097,7 +1098,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 	result = SCAN_SUCCEED;
 out:
-	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+						order);
 	return result;
 }
 
@@ -1322,7 +1324,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
-	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
 	return result;
 }
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (11 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 12/14] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-18  5:04   ` Baolin Wang
  2025-07-14  0:32 ` [PATCH v9 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
  2025-07-15  0:39 ` [PATCH v9 00/14] khugepaged: mTHP support Andrew Morton
  14 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

With mTHP support inplace, let add the per-order mTHP stats for
exceeding NONE, SWAP, and SHARED.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
 include/linux/huge_mm.h                    |  3 +++
 mm/huge_memory.c                           |  7 +++++++
 mm/khugepaged.c                            | 15 ++++++++++++---
 4 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 2c523dce6bc7..28c8af61efba 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -658,6 +658,23 @@ nr_anon_partially_mapped
        an anonymous THP as "partially mapped" and count it here, even though it
        is not actually partially mapped anymore.
 
+collapse_exceed_swap_pte
+       The number of anonymous THP which contain at least one swap PTE.
+       Currently khugepaged does not support collapsing mTHP regions that
+       contain a swap PTE.
+
+collapse_exceed_none_pte
+       The number of anonymous THP which have exceeded the none PTE threshold.
+       With mTHP collapse, a bitmap is used to gather the state of a PMD region
+       and is then recursively checked from largest to smallest order against
+       the scaled max_ptes_none count. This counter indicates that the next
+       enabled order will be checked.
+
+collapse_exceed_shared_pte
+       The number of anonymous THP which contain at least one shared PTE.
+       Currently khugepaged does not support collapsing mTHP regions that
+       contain a shared PTE.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4042078e8cc9..e0a27f80f390 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -141,6 +141,9 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT_DEFERRED,
 	MTHP_STAT_NR_ANON,
 	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+	MTHP_STAT_COLLAPSE_EXCEED_NONE,
+	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2ed9493df77..57e5699cf638 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -632,6 +632,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
 DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
 DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -648,6 +652,9 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_deferred_attr.attr,
 	&nr_anon_attr.attr,
 	&nr_anon_partially_mapped_attr.attr,
+	&collapse_exceed_swap_pte_attr.attr,
+	&collapse_exceed_none_pte_attr.attr,
+	&collapse_exceed_shared_pte_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d0c99b86b304..8a5873d0a23a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -594,7 +594,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				if (order == HPAGE_PMD_ORDER)
+					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				else
+					count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
 				goto out;
 			}
 		}
@@ -623,8 +626,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See khugepaged_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared)) {
+			if (order != HPAGE_PMD_ORDER) {
+				result = SCAN_EXCEED_SHARED_PTE;
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+				goto out;
+			}
+
+			if (cc->is_khugepaged &&
+				shared > khugepaged_max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v9 14/14] Documentation: mm: update the admin guide for mTHP collapse
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (12 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats Nico Pache
@ 2025-07-14  0:32 ` Nico Pache
  2025-07-15  0:39 ` [PATCH v9 00/14] khugepaged: mTHP support Andrew Morton
  14 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-14  0:32 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	Bagas Sanjaya

Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidence on how to utilize it.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 28c8af61efba..bd49b46398c9 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,7 @@ often.
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -144,6 +144,18 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
 sizes, the kernel will select the most appropriate enabled size for a
 given allocation.
 
+khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size
+to determine collapses. When using mTHPs it's recommended to set
+max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page
+size). This will prevent undesired "creep" behavior that leads to
+continuously collapsing to the largest mTHP size; when we collapse, we are
+bringing in new non-zero pages that will, on a subsequent scan, cause the
+max_ptes_none check of the +1 order to always be satisfied. By limiting
+this to less than half the current order, we make sure we don't cause this
+feedback loop. max_ptes_shared and max_ptes_swap have no effect when
+collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out
+pages.
+
 It's also possible to limit defrag efforts in the VM to generate
 anonymous hugepages in case they're not immediately free to madvise
 regions or to never try to defrag memory and simply fallback to regular
@@ -221,11 +233,6 @@ top-level control are "never")
 Khugepaged controls
 -------------------
 
-.. note::
-   khugepaged currently only searches for opportunities to collapse to
-   PMD-sized THP and no attempt is made to collapse to other THP
-   sizes.
-
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 00/14] khugepaged: mTHP support
  2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
                   ` (13 preceding siblings ...)
  2025-07-14  0:32 ` [PATCH v9 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2025-07-15  0:39 ` Andrew Morton
  14 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2025-07-15  0:39 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, baohua,
	willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Sun, 13 Jul 2025 18:31:53 -0600 Nico Pache <npache@redhat.com> wrote:

> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.

Thanks.  I added this to mm.git's mm-new branch.  I suppressed the
usual emails to save 532 of them.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_*
  2025-07-14  0:31 ` [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
@ 2025-07-15 15:39   ` David Hildenbrand
  2025-07-16 14:29   ` Liam R. Howlett
  2025-07-25 16:43   ` Lorenzo Stoakes
  2 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-07-15 15:39 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:31, Nico Pache wrote:
> The hpage_collapse functions describe functions used by madvise_collapse
> and khugepaged. remove the unnecessary hpage prefix to shorten the
> function name.
> 
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-07-14  0:31 ` [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-07-15 15:53   ` David Hildenbrand
  2025-07-23  1:56     ` Nico Pache
  2025-07-16 15:12   ` Liam R. Howlett
  1 sibling, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-07-15 15:53 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:31, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
> 
> Create collapse_single_pmd to increase code reuse and create an entry
> point to these two users.
> 
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 95 +++++++++++++++++++++++++------------------------
>   1 file changed, 49 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index eb0babb51868..47a80638af97 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2362,6 +2362,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>   	return result;
>   }
>   
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int collapse_single_pmd(unsigned long addr,
> +				   struct vm_area_struct *vma, bool *mmap_locked,
> +				   struct collapse_control *cc)

Nit: we tend to use two-tabs indent here.

Nice cleanup!

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 03/14] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-07-14  0:31 ` [PATCH v9 03/14] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-07-15 15:55   ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-07-15 15:55 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:31, Nico Pache wrote:
> For khugepaged to support different mTHP orders, we must generalize this
> to check if the PMD is not shared by another VMA and the order is enabled.
> 
> To ensure madvise_collapse can support working on mTHP orders without the
> PMD order enabled, we need to convert hugepage_vma_revalidate to take a
> bitmap of orders.
> 
> No functional change in this patch.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 04/14] khugepaged: generalize alloc_charge_folio()
  2025-07-14  0:31 ` [PATCH v9 04/14] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2025-07-16 13:46   ` David Hildenbrand
  2025-07-17  7:22     ` Nico Pache
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-07-16 13:46 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:31, Nico Pache wrote:
> From: Dev Jain <dev.jain@arm.com>
> 
> Pass order to alloc_charge_folio() and update mTHP statistics.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
>   include/linux/huge_mm.h                    |  2 ++
>   mm/huge_memory.c                           |  4 ++++
>   mm/khugepaged.c                            | 17 +++++++++++------
>   4 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index dff8d5985f0f..2c523dce6bc7 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -583,6 +583,14 @@ anon_fault_fallback_charge
>   	instead falls back to using huge pages with lower orders or
>   	small pages even though the allocation was successful.
>   
> +collapse_alloc
> +	is incremented every time a huge page is successfully allocated for a
> +	khugepaged collapse.
> +
> +collapse_alloc_failed
> +	is incremented every time a huge page allocation fails during a
> +	khugepaged collapse.
> +
>   zswpout
>   	is incremented every time a huge page is swapped out to zswap in one
>   	piece without splitting.
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7748489fde1b..4042078e8cc9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -125,6 +125,8 @@ enum mthp_stat_item {
>   	MTHP_STAT_ANON_FAULT_ALLOC,
>   	MTHP_STAT_ANON_FAULT_FALLBACK,
>   	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> +	MTHP_STAT_COLLAPSE_ALLOC,
> +	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
>   	MTHP_STAT_ZSWPOUT,
>   	MTHP_STAT_SWPIN,
>   	MTHP_STAT_SWPIN_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bd7a623d7ef8..e2ed9493df77 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -614,6 +614,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>   DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
>   DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
>   DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
>   DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
> @@ -679,6 +681,8 @@ static struct attribute *any_stats_attrs[] = {
>   #endif
>   	&split_attr.attr,
>   	&split_failed_attr.attr,
> +	&collapse_alloc_attr.attr,
> +	&collapse_alloc_failed_attr.attr,
>   	NULL,
>   };
>   
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fa0642e66790..cc9a35185604 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1068,21 +1068,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   }
>   
>   static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, u8 order)

u8, really? :)

Just use an "unsigned int" like folio_order() would or what 
__folio_alloc() consumes.



Apart from that

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-14  0:31 ` [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2025-07-16 13:52   ` David Hildenbrand
  2025-07-17  7:22     ` Nico Pache
  2025-07-16 14:02   ` David Hildenbrand
  2025-07-25 16:09   ` Lorenzo Stoakes
  2 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-07-16 13:52 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:31, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* functions
> to support future mTHP collapse.
> 
> mTHP collapse can suffer from incosistant behavior, and memory waste
> "creep". disable swapin and shared support for mTHP collapse.
> 
> No functional changes in this patch.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 49 +++++++++++++++++++++++++++++++------------------
>   1 file changed, 31 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cc9a35185604..ee54e3c1db4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -552,15 +552,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   					unsigned long address,
>   					pte_t *pte,
>   					struct collapse_control *cc,
> -					struct list_head *compound_pagelist)
> +					struct list_head *compound_pagelist,
> +					u8 order)

u8 ... (applies to all instances)

>   {
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
>   	pte_t *_pte;
>   	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>   	bool writable = false;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);

"scaled_max_ptes_none" maybe?

>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1 << order);
>   	     _pte++, address += PAGE_SIZE) {
>   		pte_t pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || (pte_present(pteval) &&
> @@ -568,7 +570,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   			++none_or_zero;
>   			if (!userfaultfd_armed(vma) &&
>   			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			     none_or_zero <= scaled_none)) {
>   				continue;
>   			} else {
>   				result = SCAN_EXCEED_NONE_PTE;
> @@ -596,8 +598,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		/* See hpage_collapse_scan_pmd(). */
>   		if (folio_maybe_mapped_shared(folio)) {
>   			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> +			    shared > khugepaged_max_ptes_shared)) {

Please add a comment why we do something different with PMD. As 
commenting below, does this deserve a TODO?

>   				result = SCAN_EXCEED_SHARED_PTE;
>   				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>   				goto out;
> @@ -698,13 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   						struct vm_area_struct *vma,
>   						unsigned long address,
>   						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +						struct list_head *compound_pagelist,
> +						u8 order)
>   {
>   	struct folio *src, *tmp;
>   	pte_t *_pte;
>   	pte_t pteval;
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1 << order);
>   	     _pte++, address += PAGE_SIZE) {
>   		pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> @@ -751,7 +754,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   					     pmd_t *pmd,
>   					     pmd_t orig_pmd,
>   					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +					     struct list_head *compound_pagelist,
> +					     u8 order)
>   {
>   	spinlock_t *pmd_ptl;
>   
> @@ -768,7 +772,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   	 * Release both raw and compound pages isolated
>   	 * in __collapse_huge_page_isolate.
>   	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
>   }
>   
>   /*
> @@ -789,7 +793,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>   		unsigned long address, spinlock_t *ptl,
> -		struct list_head *compound_pagelist)
> +		struct list_head *compound_pagelist, u8 order)
>   {
>   	unsigned int i;
>   	int result = SCAN_SUCCEED;
> @@ -797,7 +801,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   	/*
>   	 * Copying pages' contents is subject to memory poison at any iteration.
>   	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < (1 << order); i++) {
>   		pte_t pteval = ptep_get(pte + i);
>   		struct page *page = folio_page(folio, i);
>   		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -816,10 +820,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   
>   	if (likely(result == SCAN_SUCCEED))
>   		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    compound_pagelist, order);
>   	else
>   		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 compound_pagelist, order);
>   
>   	return result;
>   }
> @@ -994,11 +998,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>   static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   				       struct vm_area_struct *vma,
>   				       unsigned long haddr, pmd_t *pmd,
> -				       int referenced)
> +				       int referenced, u8 order)
>   {
>   	int swapped_in = 0;
>   	vm_fault_t ret = 0;
> -	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long address, end = haddr + (PAGE_SIZE << order);
>   	int result;
>   	pte_t *pte = NULL;
>   	spinlock_t *ptl;
> @@ -1029,6 +1033,15 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   		if (!is_swap_pte(vmf.orig_pte))
>   			continue;
>   
> +		/* Dont swapin for mTHP collapse */

Should we turn this into a TODO, because it's something to figure out 
regarding the scaling etc?

> +		if (order != HPAGE_PMD_ORDER) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +			pte_unmap(pte);
> +			mmap_read_unlock(mm);
> +			result = SCAN_EXCEED_SWAP_PTE;
> +			goto out;
> +		}
> +
>   		vmf.pte = pte;
>   		vmf.ptl = ptl;
>   		ret = do_swap_page(&vmf);
> @@ -1149,7 +1162,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   		 * that case.  Continuing to collapse causes inconsistency.
>   		 */
>   		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced);
 > +				referenced, HPAGE_PMD_ORDER);

Indent messed up. Feel free to exceed 80 chars if it aids readability.

>   		if (result != SCAN_SUCCEED)
>   			goto out_nolock;
>   	}
> @@ -1197,7 +1210,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>   	if (pte) {
>   		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      &compound_pagelist);
> +					&compound_pagelist, HPAGE_PMD_ORDER);

Dito.


Apart from that, nothing jumped at me

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-14  0:31 ` [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
  2025-07-16 13:52   ` David Hildenbrand
@ 2025-07-16 14:02   ` David Hildenbrand
  2025-07-17  7:23     ` Nico Pache
  2025-07-17 15:54     ` Lorenzo Stoakes
  2025-07-25 16:09   ` Lorenzo Stoakes
  2 siblings, 2 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-07-16 14:02 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:31, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* functions
> to support future mTHP collapse.
> 
> mTHP collapse can suffer from incosistant behavior, and memory waste
> "creep". disable swapin and shared support for mTHP collapse.
> 
> No functional changes in this patch.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 49 +++++++++++++++++++++++++++++++------------------
>   1 file changed, 31 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cc9a35185604..ee54e3c1db4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -552,15 +552,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   					unsigned long address,
>   					pte_t *pte,
>   					struct collapse_control *cc,
> -					struct list_head *compound_pagelist)
> +					struct list_head *compound_pagelist,
> +					u8 order)
>   {
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
>   	pte_t *_pte;
>   	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>   	bool writable = false;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1 << order);
>   	     _pte++, address += PAGE_SIZE) {
>   		pte_t pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || (pte_present(pteval) &&
> @@ -568,7 +570,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   			++none_or_zero;
>   			if (!userfaultfd_armed(vma) &&
>   			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			     none_or_zero <= scaled_none)) {
>   				continue;
>   			} else {
>   				result = SCAN_EXCEED_NONE_PTE;
> @@ -596,8 +598,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		/* See hpage_collapse_scan_pmd(). */
>   		if (folio_maybe_mapped_shared(folio)) {
>   			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> +			    shared > khugepaged_max_ptes_shared)) {
>   				result = SCAN_EXCEED_SHARED_PTE;
>   				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>   				goto out;
> @@ -698,13 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   						struct vm_area_struct *vma,
>   						unsigned long address,
>   						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +						struct list_head *compound_pagelist,
> +						u8 order)
>   {
>   	struct folio *src, *tmp;
>   	pte_t *_pte;
>   	pte_t pteval;
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1 << order);
>   	     _pte++, address += PAGE_SIZE) {
>   		pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> @@ -751,7 +754,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   					     pmd_t *pmd,
>   					     pmd_t orig_pmd,
>   					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +					     struct list_head *compound_pagelist,
> +					     u8 order)
>   {
>   	spinlock_t *pmd_ptl;
>   
> @@ -768,7 +772,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   	 * Release both raw and compound pages isolated
>   	 * in __collapse_huge_page_isolate.
>   	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
>   }
>   
>   /*
> @@ -789,7 +793,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>   		unsigned long address, spinlock_t *ptl,
> -		struct list_head *compound_pagelist)
> +		struct list_head *compound_pagelist, u8 order)
>   {
>   	unsigned int i;
>   	int result = SCAN_SUCCEED;
> @@ -797,7 +801,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   	/*
>   	 * Copying pages' contents is subject to memory poison at any iteration.
>   	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < (1 << order); i++) {
>   		pte_t pteval = ptep_get(pte + i);
>   		struct page *page = folio_page(folio, i);
>   		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -816,10 +820,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   
>   	if (likely(result == SCAN_SUCCEED))
>   		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    compound_pagelist, order);
>   	else
>   		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 compound_pagelist, order);
>   
>   	return result;
>   }
> @@ -994,11 +998,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>   static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   				       struct vm_area_struct *vma,
>   				       unsigned long haddr, pmd_t *pmd,
> -				       int referenced)
> +				       int referenced, u8 order)
>   {
>   	int swapped_in = 0;
>   	vm_fault_t ret = 0;
> -	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long address, end = haddr + (PAGE_SIZE << order);
>   	int result;
>   	pte_t *pte = NULL;
>   	spinlock_t *ptl;
> @@ -1029,6 +1033,15 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   		if (!is_swap_pte(vmf.orig_pte))
>   			continue;
>   
> +		/* Dont swapin for mTHP collapse */
> +		if (order != HPAGE_PMD_ORDER) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);

Doesn't compile. This is introduced way later in this series.

Using something like

git rebase -i mm/mm-unstable --exec "make -j16"

You can efficiently make sure that individual patches compile cleanly.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap for mTHP support
  2025-07-14  0:31 ` [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap " Nico Pache
@ 2025-07-16 14:03   ` David Hildenbrand
  2025-07-17  7:23     ` Nico Pache
  2025-07-16 15:38   ` Liam R. Howlett
  1 sibling, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-07-16 14:03 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:31, Nico Pache wrote:
> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> To add mTHP support we use this scan to instead record chunks of utilized
> sections of the PMD.
> 
> collapse_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of utilized regions. We can then determine what
> mTHP size fits best and in the following patch, we set this bitmap while
> scanning the anon PMD. A minimum collapse order of 2 is used as this is
> the lowest order supported by anon memory.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> When attempting to collapse an order that has its order set to "always"
> lets always collapse to that order in a greedy manner without
> considering the number of bits set.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

Any reason this should not be squashed into the actual mTHP collapse patch?

In particular

a) The locking changes look weird without the bigger context

b) The compiler complains about unused functions

> ---
>   include/linux/khugepaged.h |  4 ++
>   mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
>   2 files changed, 89 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index ff6120463745..0f957711a117 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,6 +1,10 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
>   #ifndef _LINUX_KHUGEPAGED_H
>   #define _LINUX_KHUGEPAGED_H
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +#define KHUGEPAGED_MIN_MTHP_NR	(1<<KHUGEPAGED_MIN_MTHP_ORDER)

"1 << "

> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
>   
>   extern unsigned int khugepaged_max_ptes_none __read_mostly;
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ee54e3c1db4e..59b2431ca616 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>   
>   static struct kmem_cache *mm_slot_cache __ro_after_init;
>   
> +struct scan_bit_state {
> +	u8 order;
> +	u16 offset;
> +};
> +
>   struct collapse_control {
>   	bool is_khugepaged;
>   
> @@ -102,6 +107,18 @@ struct collapse_control {
>   
>   	/* nodemask for allocation fallback */
>   	nodemask_t alloc_nmask;
> +
> +	/*
> +	 * bitmap used to collapse mTHP sizes.
> +	 * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> +	 */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> +	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> +	.is_khugepaged = true,
>   };
>   
>   /**
> @@ -838,10 +855,6 @@ static void khugepaged_alloc_sleep(void)
>   	remove_wait_queue(&khugepaged_wait, &wait);
>   }
>   
> -struct collapse_control khugepaged_collapse_control = {
> -	.is_khugepaged = true,
> -};
> -
>   static bool collapse_scan_abort(int nid, struct collapse_control *cc)
>   {
>   	int i;
> @@ -1115,7 +1128,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>   
>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, bool *mmap_locked,
> +				  u8 order, u16 offset)

Indent broken.

>   {
>   	LIST_HEAD(compound_pagelist);
>   	pmd_t *pmd, _pmd;
> @@ -1134,8 +1148,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * The allocation can take potentially a long time if it involves
>   	 * sync compaction, and we do not need to hold the mmap_lock during
>   	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>   	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}
>   
>   	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)
> @@ -1272,12 +1290,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   out_up_write:
>   	mmap_write_unlock(mm);
>   out_nolock:
> +	*mmap_locked = false;
>   	if (folio)
>   		folio_put(folio);
>   	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>   	return result;
>   }
>   
> +/* Recursive function to consume the bitmap */
> +static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	u8 order, next_order;
> +	u16 offset, mid_offset;
> +	int num_chunks;
> +	int bits_set, threshold_bits;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +	bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> +		offset = state.offset;
> +		num_chunks = 1 << (state.order);
> +		// Skip mTHP orders that are not enabled


/* */

Same applies to the other instances.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_*
  2025-07-14  0:31 ` [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
  2025-07-15 15:39   ` David Hildenbrand
@ 2025-07-16 14:29   ` Liam R. Howlett
  2025-07-16 15:20     ` David Hildenbrand
  2025-07-17  7:21     ` Nico Pache
  2025-07-25 16:43   ` Lorenzo Stoakes
  2 siblings, 2 replies; 51+ messages in thread
From: Liam R. Howlett @ 2025-07-16 14:29 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

* Nico Pache <npache@redhat.com> [250713 20:33]:
> The hpage_collapse functions describe functions used by madvise_collapse
> and khugepaged. remove the unnecessary hpage prefix to shorten the
> function name.
> 
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

This is funny.  I suggested this sort of thing in v7 but you said that
David H. said what to do, but then in v8 there was a discussion where
David said differently..

Yes, I much prefer dropping the prefix that is already implied by the
file for static inline functions than anything else from the names.

Thanks, this looks nicer.

> ---
>  mm/khugepaged.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a55fb1dcd224..eb0babb51868 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
>  	kmem_cache_destroy(mm_slot_cache);
>  }
>  
> -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> +static inline int collapse_test_exit(struct mm_struct *mm)
>  {
>  	return atomic_read(&mm->mm_users) == 0;
>  }

...

> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> +static int collapse_scan_pmd(struct mm_struct *mm,
>  				   struct vm_area_struct *vma,
>  				   unsigned long address, bool *mmap_locked,
>  				   struct collapse_control *cc)

One thing I noticed here.

Usually we try to do two tab indents on arguments because it allows for
less lines and less churn on argument list edits.

That is, if you have two tabs then it does not line up with the code
below and allows more arguments on the same line.

It also means that if the name changes, then you don't have to change
the white space of the argument list.

On that note, the spacing is now off where the names changed, but this
isn't a huge deal and I suspect it changes later anyways?  Anyways, this
is more of a nit than anything.. The example above looks like it didn't
line up to begin with.

...

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 08/14] khugepaged: skip collapsing mTHP to smaller orders
  2025-07-14  0:32 ` [PATCH v9 08/14] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-07-16 14:32   ` David Hildenbrand
  2025-07-17  7:24     ` Nico Pache
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-07-16 14:32 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:32, Nico Pache wrote:
> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> some pages being unmapped. Skip these cases until we have a way to check
> if its ok to collapse to a smaller mTHP size (like in the case of a
> partially mapped folio).
> 
> This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
> 
> [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5d7c5be9097e..a701d9f0f158 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -612,7 +612,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		folio = page_folio(page);
>   		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>   
> -		/* See hpage_collapse_scan_pmd(). */
> +		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
> +			result = SCAN_PTE_MAPPED_HUGEPAGE;
> +			goto out;
> +		}

Probably worth adding a TODO in the code like

/*
  * TODO: In some cases of partially-mapped folios, we'd actually
  * want to collapse.
  */

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-07-14  0:31 ` [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
  2025-07-15 15:53   ` David Hildenbrand
@ 2025-07-16 15:12   ` Liam R. Howlett
  2025-07-23  1:55     ` Nico Pache
  1 sibling, 1 reply; 51+ messages in thread
From: Liam R. Howlett @ 2025-07-16 15:12 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

* Nico Pache <npache@redhat.com> [250713 20:33]:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
> 
> Create collapse_single_pmd to increase code reuse and create an entry
> point to these two users.
> 
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 95 +++++++++++++++++++++++++------------------------
>  1 file changed, 49 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index eb0babb51868..47a80638af97 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2362,6 +2362,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int collapse_single_pmd(unsigned long addr,
> +				   struct vm_area_struct *vma, bool *mmap_locked,
> +				   struct collapse_control *cc)
> +{
> +	int result = SCAN_FAIL;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (!vma_is_anonymous(vma)) {
> +		struct file *file = get_file(vma->vm_file);
> +		pgoff_t pgoff = linear_page_index(vma, addr);
> +
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;

Okay, just for my sanity, when we reach this part.. mmap_locked will
be false on return.  Because we set it a bunch more below.. but it's
always false on return.

Although this is cleaner implementation of the lock, I'm just not sure
why you keep flipping the mmap_locked variable here?  We could probably
get away with comments that it will always be false.


> +		result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +		fput(file);
> +		if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +			mmap_read_lock(mm);
> +			*mmap_locked = true;
> +			if (collapse_test_exit_or_disable(mm)) {
> +				mmap_read_unlock(mm);
> +				*mmap_locked = false;
> +				result = SCAN_ANY_PROCESS;
> +				goto end;
> +			}
> +			result = collapse_pte_mapped_thp(mm, addr,
> +							 !cc->is_khugepaged);
> +			if (result == SCAN_PMD_MAPPED)
> +				result = SCAN_SUCCEED;
> +			mmap_read_unlock(mm);
> +			*mmap_locked = false;
> +		}
> +	} else {
> +		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> +	}
> +	if (cc->is_khugepaged && result == SCAN_SUCCEED)
> +		++khugepaged_pages_collapsed;
> +end:
> +	return result;
> +}
> +
>  static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>  					    struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
> @@ -2436,34 +2480,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>  				  hend);
> -			if (!vma_is_anonymous(vma)) {
> -				struct file *file = get_file(vma->vm_file);
> -				pgoff_t pgoff = linear_page_index(vma,
> -						khugepaged_scan.address);
> -
> -				mmap_read_unlock(mm);
> -				mmap_locked = false;
> -				*result = hpage_collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> -				fput(file);
> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -					mmap_read_lock(mm);
> -					if (hpage_collapse_test_exit_or_disable(mm))
> -						goto breakouterloop;
> -					*result = collapse_pte_mapped_thp(mm,
> -						khugepaged_scan.address, false);
> -					if (*result == SCAN_PMD_MAPPED)
> -						*result = SCAN_SUCCEED;
> -					mmap_read_unlock(mm);
> -				}
> -			} else {
> -				*result = hpage_collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> -			}
> -
> -			if (*result == SCAN_SUCCEED)
> -				++khugepaged_pages_collapsed;
>  
> +			*result = collapse_single_pmd(khugepaged_scan.address,
> +						vma, &mmap_locked, cc);
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
>  			progress += HPAGE_PMD_NR;
> @@ -2780,35 +2799,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  		mmap_assert_locked(mm);
>  		memset(cc->node_load, 0, sizeof(cc->node_load));
>  		nodes_clear(cc->alloc_nmask);
> -		if (!vma_is_anonymous(vma)) {
> -			struct file *file = get_file(vma->vm_file);
> -			pgoff_t pgoff = linear_page_index(vma, addr);
>  
> -			mmap_read_unlock(mm);
> -			mmap_locked = false;
> -			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> -							  cc);
> -			fput(file);
> -		} else {
> -			result = hpage_collapse_scan_pmd(mm, vma, addr,
> -							 &mmap_locked, cc);
> -		}
> +		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> +
>  		if (!mmap_locked)
>  			*lock_dropped = true;

All of this locking is scary, because there are comments everywhere that
imply that mmap_locked indicates that the lock was dropped at some
point, but we are using it to indicate that the lock is currently held -
which are very different things..

Here, for example locked_dropped may not be set to true event though we
have toggled it through collapse_single_pmd() -> collapse_scan_pmd() ->
... -> collapse_huge_page().

Maybe these scenarios are safe because of known limitations of what will
or will not happen, but the code paths existing without a comment about
why it is safe seems like a good way to introduce races later.

>  
> -handle_result:
>  		switch (result) {
>  		case SCAN_SUCCEED:
>  		case SCAN_PMD_MAPPED:
>  			++thps;
>  			break;
> -		case SCAN_PTE_MAPPED_HUGEPAGE:
> -			BUG_ON(mmap_locked);
> -			mmap_read_lock(mm);
> -			result = collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_read_unlock(mm);
> -			goto handle_result;
>  		/* Whitelisted set of results where continuing OK */
> +		case SCAN_PTE_MAPPED_HUGEPAGE:
>  		case SCAN_PMD_NULL:
>  		case SCAN_PTE_NON_PRESENT:
>  		case SCAN_PTE_UFFD_WP:
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_*
  2025-07-16 14:29   ` Liam R. Howlett
@ 2025-07-16 15:20     ` David Hildenbrand
  2025-07-17  7:21     ` Nico Pache
  1 sibling, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-07-16 15:20 UTC (permalink / raw)
  To: Liam R. Howlett, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

On 16.07.25 16:29, Liam R. Howlett wrote:
> * Nico Pache <npache@redhat.com> [250713 20:33]:
>> The hpage_collapse functions describe functions used by madvise_collapse
>> and khugepaged. remove the unnecessary hpage prefix to shorten the
>> function name.
>>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> 
> This is funny.  I suggested this sort of thing in v7 but you said that
> David H. said what to do, but then in v8 there was a discussion where
> David said differently..

Me recommending something that doesn't make sense? That's unpossible!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders
  2025-07-14  0:32 ` [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders Nico Pache
@ 2025-07-16 15:28   ` David Hildenbrand
  2025-07-17  7:25     ` Nico Pache
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-07-16 15:28 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:32, Nico Pache wrote:
> From: Baolin Wang <baolin.wang@linux.alibaba.com>

Should the subject better be

"mm/khugepaged: enable collapsing mTHPs even when PMD THPs are disabled"

(in general, I assume all subjects should be prefixed by "mm/khugepaged:")

> 
> We have now allowed mTHP collapse, but thp_vma_allowable_order() still only
> checks if the PMD-sized mTHP is allowed to collapse. This prevents scanning
> and collapsing of 64K mTHP when only 64K mTHP is enabled. Thus, we should
> modify the checks to allow all large orders of anonymous mTHP.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7a9c4edf0e23..3772dc0d78ea 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -491,8 +491,11 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>   {
>   	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
>   	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
> -					    PMD_ORDER))
> +		unsigned long orders = vma_is_anonymous(vma) ?
> +					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
> +
> +		if (thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
> +					    orders))
>   			__khugepaged_enter(vma->vm_mm);
>   	}
>   }
> @@ -2624,6 +2627,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>   
>   	vma_iter_init(&vmi, mm, khugepaged_scan.address);
>   	for_each_vma(vmi, vma) {
> +		unsigned long orders = vma_is_anonymous(vma) ?
> +					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
>   		unsigned long hstart, hend;
>   
>   		cond_resched();
> @@ -2631,8 +2636,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>   			progress++;
>   			break;
>   		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags,
> -					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
> +		if (!thp_vma_allowable_orders(vma, vma->vm_flags,
> +			TVA_ENFORCE_SYSFS, orders)) {
>   skip:
>   			progress++;
>   			continue;

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap for mTHP support
  2025-07-14  0:31 ` [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap " Nico Pache
  2025-07-16 14:03   ` David Hildenbrand
@ 2025-07-16 15:38   ` Liam R. Howlett
  2025-07-17  7:24     ` Nico Pache
  1 sibling, 1 reply; 51+ messages in thread
From: Liam R. Howlett @ 2025-07-16 15:38 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

* Nico Pache <npache@redhat.com> [250713 20:34]:
> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> To add mTHP support we use this scan to instead record chunks of utilized
> sections of the PMD.
> 
> collapse_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of utilized regions. We can then determine what
> mTHP size fits best and in the following patch, we set this bitmap while
> scanning the anon PMD. A minimum collapse order of 2 is used as this is
> the lowest order supported by anon memory.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> When attempting to collapse an order that has its order set to "always"
> lets always collapse to that order in a greedy manner without
> considering the number of bits set.
> 

v7 had talks about having selftests of this code.  You mention you used
selftests mm in the cover letter but it seems you did not add the
reproducer that Baolin had?

Maybe I missed that?

> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/linux/khugepaged.h |  4 ++
>  mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
>  2 files changed, 89 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index ff6120463745..0f957711a117 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,6 +1,10 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _LINUX_KHUGEPAGED_H
>  #define _LINUX_KHUGEPAGED_H
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +#define KHUGEPAGED_MIN_MTHP_NR	(1<<KHUGEPAGED_MIN_MTHP_ORDER)
> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
>  
>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ee54e3c1db4e..59b2431ca616 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>  
> +struct scan_bit_state {
> +	u8 order;
> +	u16 offset;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>  
> @@ -102,6 +107,18 @@ struct collapse_control {
>  
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/*
> +	 * bitmap used to collapse mTHP sizes.
> +	 * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> +	 */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> +	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> +	.is_khugepaged = true,
>  };
>  
>  /**
> @@ -838,10 +855,6 @@ static void khugepaged_alloc_sleep(void)
>  	remove_wait_queue(&khugepaged_wait, &wait);
>  }
>  
> -struct collapse_control khugepaged_collapse_control = {
> -	.is_khugepaged = true,
> -};
> -
>  static bool collapse_scan_abort(int nid, struct collapse_control *cc)
>  {
>  	int i;
> @@ -1115,7 +1128,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  
>  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, bool *mmap_locked,
> +				  u8 order, u16 offset)
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> @@ -1134,8 +1148,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * The allocation can take potentially a long time if it involves
>  	 * sync compaction, and we do not need to hold the mmap_lock during
>  	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>  	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}
>  
>  	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
> @@ -1272,12 +1290,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  out_up_write:
>  	mmap_write_unlock(mm);
>  out_nolock:
> +	*mmap_locked = false;
>  	if (folio)
>  		folio_put(folio);
>  	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>  	return result;
>  }
>  
> +/* Recursive function to consume the bitmap */
> +static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	u8 order, next_order;
> +	u16 offset, mid_offset;
> +	int num_chunks;
> +	int bits_set, threshold_bits;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +	bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> +		offset = state.offset;
> +		num_chunks = 1 << (state.order);
> +		// Skip mTHP orders that are not enabled
> +		if (!test_bit(order, &enabled_orders))
> +			goto next;
> +
> +		// copy the relavant section to a new bitmap
> +		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> +				  MTHP_BITMAP_SIZE);
> +
> +		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> +		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> +				>> (HPAGE_PMD_ORDER - state.order);
> +
> +		//Check if the region is "almost full" based on the threshold
> +		if (bits_set > threshold_bits || is_pmd_only
> +			|| test_bit(order, &huge_anon_orders_always)) {
> +			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> +					mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += (1 << order);
> +				continue;
> +			}
> +		}
> +
> +next:
> +		if (state.order > 0) {
> +			next_order = state.order - 1;
> +			mid_offset = offset + (num_chunks / 2);
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, mid_offset };
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, offset };
> +			}
> +	}
> +	return collapsed;
> +}
> +
>  static int collapse_scan_pmd(struct mm_struct *mm,
>  				   struct vm_area_struct *vma,
>  				   unsigned long address, bool *mmap_locked,
> @@ -1444,9 +1522,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
> -- 
> 2.50.0
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_*
  2025-07-16 14:29   ` Liam R. Howlett
  2025-07-16 15:20     ` David Hildenbrand
@ 2025-07-17  7:21     ` Nico Pache
  1 sibling, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:21 UTC (permalink / raw)
  To: Liam R. Howlett, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, david, ziy, baolin.wang, lorenzo.stoakes,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 8:30 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Nico Pache <npache@redhat.com> [250713 20:33]:
> > The hpage_collapse functions describe functions used by madvise_collapse
> > and khugepaged. remove the unnecessary hpage prefix to shorten the
> > function name.
> >
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
>
> This is funny.  I suggested this sort of thing in v7 but you said that
> David H. said what to do, but then in v8 there was a discussion where
> David said differently..
Haha yes I'm sorry, I honestly misunderstood your request to mean
"drop hpage_collapse" not just "hpage". In a meeting with David early
on in this work he recommended renaming these. Dev made a good point
that renaming these to khugepaged is a revert of previous commit.
>
> Yes, I much prefer dropping the prefix that is already implied by the
> file for static inline functions than anything else from the names.
>
> Thanks, this looks nicer.
I agree, thanks!
>
>
> > ---
> >  mm/khugepaged.c | 46 +++++++++++++++++++++++-----------------------
> >  1 file changed, 23 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index a55fb1dcd224..eb0babb51868 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
> >       kmem_cache_destroy(mm_slot_cache);
> >  }
> >
> > -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> > +static inline int collapse_test_exit(struct mm_struct *mm)
> >  {
> >       return atomic_read(&mm->mm_users) == 0;
> >  }
>
> ...
>
> > -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > +static int collapse_scan_pmd(struct mm_struct *mm,
> >                                  struct vm_area_struct *vma,
> >                                  unsigned long address, bool *mmap_locked,
> >                                  struct collapse_control *cc)
>
> One thing I noticed here.
>
> Usually we try to do two tab indents on arguments because it allows for
> less lines and less churn on argument list edits.
>
> That is, if you have two tabs then it does not line up with the code
> below and allows more arguments on the same line.
>
> It also means that if the name changes, then you don't have to change
> the white space of the argument list.
>
> On that note, the spacing is now off where the names changed, but this
> isn't a huge deal and I suspect it changes later anyways?  Anyways, this
> is more of a nit than anything.. The example above looks like it didn't
> line up to begin with.
I went through and cleaned these up, both on this patch and future
patches that had similar indentation issues.
>
> ...
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Thanks for your review!
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 04/14] khugepaged: generalize alloc_charge_folio()
  2025-07-16 13:46   ` David Hildenbrand
@ 2025-07-17  7:22     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.25 02:31, Nico Pache wrote:
> > From: Dev Jain <dev.jain@arm.com>
> >
> > Pass order to alloc_charge_folio() and update mTHP statistics.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Co-developed-by: Nico Pache <npache@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> >   Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
> >   include/linux/huge_mm.h                    |  2 ++
> >   mm/huge_memory.c                           |  4 ++++
> >   mm/khugepaged.c                            | 17 +++++++++++------
> >   4 files changed, 25 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index dff8d5985f0f..2c523dce6bc7 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -583,6 +583,14 @@ anon_fault_fallback_charge
> >       instead falls back to using huge pages with lower orders or
> >       small pages even though the allocation was successful.
> >
> > +collapse_alloc
> > +     is incremented every time a huge page is successfully allocated for a
> > +     khugepaged collapse.
> > +
> > +collapse_alloc_failed
> > +     is incremented every time a huge page allocation fails during a
> > +     khugepaged collapse.
> > +
> >   zswpout
> >       is incremented every time a huge page is swapped out to zswap in one
> >       piece without splitting.
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 7748489fde1b..4042078e8cc9 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -125,6 +125,8 @@ enum mthp_stat_item {
> >       MTHP_STAT_ANON_FAULT_ALLOC,
> >       MTHP_STAT_ANON_FAULT_FALLBACK,
> >       MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> > +     MTHP_STAT_COLLAPSE_ALLOC,
> > +     MTHP_STAT_COLLAPSE_ALLOC_FAILED,
> >       MTHP_STAT_ZSWPOUT,
> >       MTHP_STAT_SWPIN,
> >       MTHP_STAT_SWPIN_FALLBACK,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bd7a623d7ef8..e2ed9493df77 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -614,6 +614,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
> >   DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
> >   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
> >   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> > +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
> > +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
> >   DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
> >   DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
> >   DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
> > @@ -679,6 +681,8 @@ static struct attribute *any_stats_attrs[] = {
> >   #endif
> >       &split_attr.attr,
> >       &split_failed_attr.attr,
> > +     &collapse_alloc_attr.attr,
> > +     &collapse_alloc_failed_attr.attr,
> >       NULL,
> >   };
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fa0642e66790..cc9a35185604 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1068,21 +1068,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >   }
> >
> >   static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> > -                           struct collapse_control *cc)
> > +                           struct collapse_control *cc, u8 order)
>
> u8, really? :)
At the time I knew I was going to use u8's at the bitmap level so I
thought I should have them here too. But you are right I went through
and cleaned up all the u8 usage with the exception of the actual
bitmap storage.
>
> Just use an "unsigned int" like folio_order() would or what
> __folio_alloc() consumes.
>
>
>
> Apart from that
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thank you!

>
> --
> Cheers,
>
> David / dhildenb
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-16 13:52   ` David Hildenbrand
@ 2025-07-17  7:22     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 7:53 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.25 02:31, Nico Pache wrote:
> > generalize the order of the __collapse_huge_page_* functions
> > to support future mTHP collapse.
> >
> > mTHP collapse can suffer from incosistant behavior, and memory waste
> > "creep". disable swapin and shared support for mTHP collapse.
> >
> > No functional changes in this patch.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 49 +++++++++++++++++++++++++++++++------------------
> >   1 file changed, 31 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index cc9a35185604..ee54e3c1db4e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -552,15 +552,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                                       unsigned long address,
> >                                       pte_t *pte,
> >                                       struct collapse_control *cc,
> > -                                     struct list_head *compound_pagelist)
> > +                                     struct list_head *compound_pagelist,
> > +                                     u8 order)
>
> u8 ... (applies to all instances)
Fixed all instances of this (other than those that need to stay)
>
> >   {
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       pte_t *_pte;
> >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> >       bool writable = false;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>
> "scaled_max_ptes_none" maybe?
done!
>
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > +     for (_pte = pte; _pte < pte + (1 << order);
> >            _pte++, address += PAGE_SIZE) {
> >               pte_t pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || (pte_present(pteval) &&
> > @@ -568,7 +570,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                       ++none_or_zero;
> >                       if (!userfaultfd_armed(vma) &&
> >                           (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > +                          none_or_zero <= scaled_none)) {
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -596,8 +598,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               /* See hpage_collapse_scan_pmd(). */
> >               if (folio_maybe_mapped_shared(folio)) {
> >                       ++shared;
> > -                     if (cc->is_khugepaged &&
> > -                         shared > khugepaged_max_ptes_shared) {
> > +                     if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> > +                         shared > khugepaged_max_ptes_shared)) {
>
> Please add a comment why we do something different with PMD. As
> commenting below, does this deserve a TODO?
>
> >                               result = SCAN_EXCEED_SHARED_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >                               goto out;
> > @@ -698,13 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >                                               struct vm_area_struct *vma,
> >                                               unsigned long address,
> >                                               spinlock_t *ptl,
> > -                                             struct list_head *compound_pagelist)
> > +                                             struct list_head *compound_pagelist,
> > +                                             u8 order)
> >   {
> >       struct folio *src, *tmp;
> >       pte_t *_pte;
> >       pte_t pteval;
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > +     for (_pte = pte; _pte < pte + (1 << order);
> >            _pte++, address += PAGE_SIZE) {
> >               pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > @@ -751,7 +754,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >                                            pmd_t *pmd,
> >                                            pmd_t orig_pmd,
> >                                            struct vm_area_struct *vma,
> > -                                          struct list_head *compound_pagelist)
> > +                                          struct list_head *compound_pagelist,
> > +                                          u8 order)
> >   {
> >       spinlock_t *pmd_ptl;
> >
> > @@ -768,7 +772,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >        * Release both raw and compound pages isolated
> >        * in __collapse_huge_page_isolate.
> >        */
> > -     release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> > +     release_pte_pages(pte, pte + (1 << order), compound_pagelist);
> >   }
> >
> >   /*
> > @@ -789,7 +793,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >   static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >               pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> >               unsigned long address, spinlock_t *ptl,
> > -             struct list_head *compound_pagelist)
> > +             struct list_head *compound_pagelist, u8 order)
> >   {
> >       unsigned int i;
> >       int result = SCAN_SUCCEED;
> > @@ -797,7 +801,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >       /*
> >        * Copying pages' contents is subject to memory poison at any iteration.
> >        */
> > -     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +     for (i = 0; i < (1 << order); i++) {
> >               pte_t pteval = ptep_get(pte + i);
> >               struct page *page = folio_page(folio, i);
> >               unsigned long src_addr = address + i * PAGE_SIZE;
> > @@ -816,10 +820,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >
> >       if (likely(result == SCAN_SUCCEED))
> >               __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> > -                                                 compound_pagelist);
> > +                                                 compound_pagelist, order);
> >       else
> >               __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> > -                                              compound_pagelist);
> > +                                              compound_pagelist, order);
> >
> >       return result;
> >   }
> > @@ -994,11 +998,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> >   static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >                                      struct vm_area_struct *vma,
> >                                      unsigned long haddr, pmd_t *pmd,
> > -                                    int referenced)
> > +                                    int referenced, u8 order)
> >   {
> >       int swapped_in = 0;
> >       vm_fault_t ret = 0;
> > -     unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> > +     unsigned long address, end = haddr + (PAGE_SIZE << order);
> >       int result;
> >       pte_t *pte = NULL;
> >       spinlock_t *ptl;
> > @@ -1029,6 +1033,15 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >               if (!is_swap_pte(vmf.orig_pte))
> >                       continue;
> >
> > +             /* Dont swapin for mTHP collapse */
>
> Should we turn this into a TODO, because it's something to figure out
> regarding the scaling etc?
Good idea, I changed both of these into TODOs
>
> > +             if (order != HPAGE_PMD_ORDER) {
> > +                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> > +                     pte_unmap(pte);
> > +                     mmap_read_unlock(mm);
> > +                     result = SCAN_EXCEED_SWAP_PTE;
> > +                     goto out;
> > +             }
> > +
> >               vmf.pte = pte;
> >               vmf.ptl = ptl;
> >               ret = do_swap_page(&vmf);
> > @@ -1149,7 +1162,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> >               result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                                                  referenced);
>  > +                            referenced, HPAGE_PMD_ORDER);
>
> Indent messed up. Feel free to exceed 80 chars if it aids readability.
Fixed!
>
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> > @@ -1197,7 +1210,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> >       if (pte) {
> >               result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                                   &compound_pagelist);
> > +                                     &compound_pagelist, HPAGE_PMD_ORDER);
>
> Dito.
Fixed!
>
>
> Apart from that, nothing jumped at me
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thanks for the ack! I fixed the compile issue you noted too.
>
> --
> Cheers,
>
> David / dhildenb
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-16 14:02   ` David Hildenbrand
@ 2025-07-17  7:23     ` Nico Pache
  2025-07-17 15:54     ` Lorenzo Stoakes
  1 sibling, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 8:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.25 02:31, Nico Pache wrote:
> > generalize the order of the __collapse_huge_page_* functions
> > to support future mTHP collapse.
> >
> > mTHP collapse can suffer from incosistant behavior, and memory waste
> > "creep". disable swapin and shared support for mTHP collapse.
> >
> > No functional changes in this patch.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 49 +++++++++++++++++++++++++++++++------------------
> >   1 file changed, 31 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index cc9a35185604..ee54e3c1db4e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -552,15 +552,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                                       unsigned long address,
> >                                       pte_t *pte,
> >                                       struct collapse_control *cc,
> > -                                     struct list_head *compound_pagelist)
> > +                                     struct list_head *compound_pagelist,
> > +                                     u8 order)
> >   {
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       pte_t *_pte;
> >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> >       bool writable = false;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > +     for (_pte = pte; _pte < pte + (1 << order);
> >            _pte++, address += PAGE_SIZE) {
> >               pte_t pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || (pte_present(pteval) &&
> > @@ -568,7 +570,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                       ++none_or_zero;
> >                       if (!userfaultfd_armed(vma) &&
> >                           (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > +                          none_or_zero <= scaled_none)) {
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -596,8 +598,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               /* See hpage_collapse_scan_pmd(). */
> >               if (folio_maybe_mapped_shared(folio)) {
> >                       ++shared;
> > -                     if (cc->is_khugepaged &&
> > -                         shared > khugepaged_max_ptes_shared) {
> > +                     if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> > +                         shared > khugepaged_max_ptes_shared)) {
> >                               result = SCAN_EXCEED_SHARED_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >                               goto out;
> > @@ -698,13 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >                                               struct vm_area_struct *vma,
> >                                               unsigned long address,
> >                                               spinlock_t *ptl,
> > -                                             struct list_head *compound_pagelist)
> > +                                             struct list_head *compound_pagelist,
> > +                                             u8 order)
> >   {
> >       struct folio *src, *tmp;
> >       pte_t *_pte;
> >       pte_t pteval;
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > +     for (_pte = pte; _pte < pte + (1 << order);
> >            _pte++, address += PAGE_SIZE) {
> >               pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > @@ -751,7 +754,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >                                            pmd_t *pmd,
> >                                            pmd_t orig_pmd,
> >                                            struct vm_area_struct *vma,
> > -                                          struct list_head *compound_pagelist)
> > +                                          struct list_head *compound_pagelist,
> > +                                          u8 order)
> >   {
> >       spinlock_t *pmd_ptl;
> >
> > @@ -768,7 +772,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >        * Release both raw and compound pages isolated
> >        * in __collapse_huge_page_isolate.
> >        */
> > -     release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> > +     release_pte_pages(pte, pte + (1 << order), compound_pagelist);
> >   }
> >
> >   /*
> > @@ -789,7 +793,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >   static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >               pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> >               unsigned long address, spinlock_t *ptl,
> > -             struct list_head *compound_pagelist)
> > +             struct list_head *compound_pagelist, u8 order)
> >   {
> >       unsigned int i;
> >       int result = SCAN_SUCCEED;
> > @@ -797,7 +801,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >       /*
> >        * Copying pages' contents is subject to memory poison at any iteration.
> >        */
> > -     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +     for (i = 0; i < (1 << order); i++) {
> >               pte_t pteval = ptep_get(pte + i);
> >               struct page *page = folio_page(folio, i);
> >               unsigned long src_addr = address + i * PAGE_SIZE;
> > @@ -816,10 +820,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >
> >       if (likely(result == SCAN_SUCCEED))
> >               __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> > -                                                 compound_pagelist);
> > +                                                 compound_pagelist, order);
> >       else
> >               __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> > -                                              compound_pagelist);
> > +                                              compound_pagelist, order);
> >
> >       return result;
> >   }
> > @@ -994,11 +998,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> >   static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >                                      struct vm_area_struct *vma,
> >                                      unsigned long haddr, pmd_t *pmd,
> > -                                    int referenced)
> > +                                    int referenced, u8 order)
> >   {
> >       int swapped_in = 0;
> >       vm_fault_t ret = 0;
> > -     unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> > +     unsigned long address, end = haddr + (PAGE_SIZE << order);
> >       int result;
> >       pte_t *pte = NULL;
> >       spinlock_t *ptl;
> > @@ -1029,6 +1033,15 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >               if (!is_swap_pte(vmf.orig_pte))
> >                       continue;
> >
> > +             /* Dont swapin for mTHP collapse */
> > +             if (order != HPAGE_PMD_ORDER) {
> > +                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>
> Doesn't compile. This is introduced way later in this series.
Whoops I stupidly applied this fixup to the wrong commit.
>
> Using something like
>
> git rebase -i mm/mm-unstable --exec "make -j16"
Ah I remember you showing me this in the past! Need to start using it
more-- Thank you.

>
> You can efficiently make sure that individual patches compile cleanly.
>
> --
> Cheers,
>
> David / dhildenb
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap for mTHP support
  2025-07-16 14:03   ` David Hildenbrand
@ 2025-07-17  7:23     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 8:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.25 02:31, Nico Pache wrote:
> > khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> > To add mTHP support we use this scan to instead record chunks of utilized
> > sections of the PMD.
> >
> > collapse_scan_bitmap uses a stack struct to recursively scan a bitmap
> > that represents chunks of utilized regions. We can then determine what
> > mTHP size fits best and in the following patch, we set this bitmap while
> > scanning the anon PMD. A minimum collapse order of 2 is used as this is
> > the lowest order supported by anon memory.
> >
> > max_ptes_none is used as a scale to determine how "full" an order must
> > be before being considered for collapse.
> >
> > When attempting to collapse an order that has its order set to "always"
> > lets always collapse to that order in a greedy manner without
> > considering the number of bits set.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> Any reason this should not be squashed into the actual mTHP collapse patch?
I wanted to keep them seperate to conceptually separate the bitmap and
collapse, but given youre the second person to point this out, I went
ahead and squashed them and their commit messages to be one commit.
>
> In particular
>
> a) The locking changes look weird without the bigger context
>
> b) The compiler complains about unused functions
>
> > ---
> >   include/linux/khugepaged.h |  4 ++
> >   mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
> >   2 files changed, 89 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index ff6120463745..0f957711a117 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -1,6 +1,10 @@
> >   /* SPDX-License-Identifier: GPL-2.0 */
> >   #ifndef _LINUX_KHUGEPAGED_H
> >   #define _LINUX_KHUGEPAGED_H
> > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > +#define KHUGEPAGED_MIN_MTHP_NR       (1<<KHUGEPAGED_MIN_MTHP_ORDER)
>
> "1 << "
Ah there is a mix of these ("1<<" vs "1 << ") being used across the
kernel. Fixed it thank you.
>
> > +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> > +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
> >
> >   extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ee54e3c1db4e..59b2431ca616 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >   static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +struct scan_bit_state {
> > +     u8 order;
> > +     u16 offset;
> > +};
> > +
> >   struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -102,6 +107,18 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /*
> > +      * bitmap used to collapse mTHP sizes.
> > +      * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> > +      */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> > +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> > +};
> > +
> > +struct collapse_control khugepaged_collapse_control = {
> > +     .is_khugepaged = true,
> >   };
> >
> >   /**
> > @@ -838,10 +855,6 @@ static void khugepaged_alloc_sleep(void)
> >       remove_wait_queue(&khugepaged_wait, &wait);
> >   }
> >
> > -struct collapse_control khugepaged_collapse_control = {
> > -     .is_khugepaged = true,
> > -};
> > -
> >   static bool collapse_scan_abort(int nid, struct collapse_control *cc)
> >   {
> >       int i;
> > @@ -1115,7 +1128,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >
> >   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                             int referenced, int unmapped,
> > -                           struct collapse_control *cc)
> > +                           struct collapse_control *cc, bool *mmap_locked,
> > +                               u8 order, u16 offset)
>
> Indent broken.
Fixed!
>
> >   {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > @@ -1134,8 +1148,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * The allocation can take potentially a long time if it involves
> >        * sync compaction, and we do not need to hold the mmap_lock during
> >        * that. We will recheck the vma after taking it again in write mode.
> > +      * If collapsing mTHPs we may have already released the read_lock.
> >        */
> > -     mmap_read_unlock(mm);
> > +     if (*mmap_locked) {
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> > +     }
> >
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> > @@ -1272,12 +1290,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >   out_up_write:
> >       mmap_write_unlock(mm);
> >   out_nolock:
> > +     *mmap_locked = false;
> >       if (folio)
> >               folio_put(folio);
> >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> >       return result;
> >   }
> >
> > +/* Recursive function to consume the bitmap */
> > +static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > +                     int referenced, int unmapped, struct collapse_control *cc,
> > +                     bool *mmap_locked, unsigned long enabled_orders)
> > +{
> > +     u8 order, next_order;
> > +     u16 offset, mid_offset;
> > +     int num_chunks;
> > +     int bits_set, threshold_bits;
> > +     int top = -1;
> > +     int collapsed = 0;
> > +     int ret;
> > +     struct scan_bit_state state;
> > +     bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> > +
> > +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +             { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> > +
> > +     while (top >= 0) {
> > +             state = cc->mthp_bitmap_stack[top--];
> > +             order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> > +             offset = state.offset;
> > +             num_chunks = 1 << (state.order);
> > +             // Skip mTHP orders that are not enabled
>
>
> /* */
>
> Same applies to the other instances.
Thank you! Fixed all instances
>
> --
> Cheers,
>
> David / dhildenb
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap for mTHP support
  2025-07-16 15:38   ` Liam R. Howlett
@ 2025-07-17  7:24     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:24 UTC (permalink / raw)
  To: Liam R. Howlett, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, david, ziy, baolin.wang, lorenzo.stoakes,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 9:39 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Nico Pache <npache@redhat.com> [250713 20:34]:
> > khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> > To add mTHP support we use this scan to instead record chunks of utilized
> > sections of the PMD.
> >
> > collapse_scan_bitmap uses a stack struct to recursively scan a bitmap
> > that represents chunks of utilized regions. We can then determine what
> > mTHP size fits best and in the following patch, we set this bitmap while
> > scanning the anon PMD. A minimum collapse order of 2 is used as this is
> > the lowest order supported by anon memory.
> >
> > max_ptes_none is used as a scale to determine how "full" an order must
> > be before being considered for collapse.
> >
> > When attempting to collapse an order that has its order set to "always"
> > lets always collapse to that order in a greedy manner without
> > considering the number of bits set.
> >
>
> v7 had talks about having selftests of this code.  You mention you used
> selftests mm in the cover letter but it seems you did not add the
> reproducer that Baolin had?
This was in relation to the MADV_COLLAPSE issue, which we decided to
put off for now.
I can add mTHP tests to my list of future work!


-- Nico
>
> Maybe I missed that?
>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  include/linux/khugepaged.h |  4 ++
> >  mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
> >  2 files changed, 89 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index ff6120463745..0f957711a117 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -1,6 +1,10 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >  #ifndef _LINUX_KHUGEPAGED_H
> >  #define _LINUX_KHUGEPAGED_H
> > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > +#define KHUGEPAGED_MIN_MTHP_NR       (1<<KHUGEPAGED_MIN_MTHP_ORDER)
> > +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> > +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
> >
> >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ee54e3c1db4e..59b2431ca616 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +struct scan_bit_state {
> > +     u8 order;
> > +     u16 offset;
> > +};
> > +
> >  struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -102,6 +107,18 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /*
> > +      * bitmap used to collapse mTHP sizes.
> > +      * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> > +      */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> > +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> > +};
> > +
> > +struct collapse_control khugepaged_collapse_control = {
> > +     .is_khugepaged = true,
> >  };
> >
> >  /**
> > @@ -838,10 +855,6 @@ static void khugepaged_alloc_sleep(void)
> >       remove_wait_queue(&khugepaged_wait, &wait);
> >  }
> >
> > -struct collapse_control khugepaged_collapse_control = {
> > -     .is_khugepaged = true,
> > -};
> > -
> >  static bool collapse_scan_abort(int nid, struct collapse_control *cc)
> >  {
> >       int i;
> > @@ -1115,7 +1128,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >
> >  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                             int referenced, int unmapped,
> > -                           struct collapse_control *cc)
> > +                           struct collapse_control *cc, bool *mmap_locked,
> > +                               u8 order, u16 offset)
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > @@ -1134,8 +1148,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * The allocation can take potentially a long time if it involves
> >        * sync compaction, and we do not need to hold the mmap_lock during
> >        * that. We will recheck the vma after taking it again in write mode.
> > +      * If collapsing mTHPs we may have already released the read_lock.
> >        */
> > -     mmap_read_unlock(mm);
> > +     if (*mmap_locked) {
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> > +     }
> >
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> > @@ -1272,12 +1290,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  out_up_write:
> >       mmap_write_unlock(mm);
> >  out_nolock:
> > +     *mmap_locked = false;
> >       if (folio)
> >               folio_put(folio);
> >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> >       return result;
> >  }
> >
> > +/* Recursive function to consume the bitmap */
> > +static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > +                     int referenced, int unmapped, struct collapse_control *cc,
> > +                     bool *mmap_locked, unsigned long enabled_orders)
> > +{
> > +     u8 order, next_order;
> > +     u16 offset, mid_offset;
> > +     int num_chunks;
> > +     int bits_set, threshold_bits;
> > +     int top = -1;
> > +     int collapsed = 0;
> > +     int ret;
> > +     struct scan_bit_state state;
> > +     bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> > +
> > +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +             { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> > +
> > +     while (top >= 0) {
> > +             state = cc->mthp_bitmap_stack[top--];
> > +             order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> > +             offset = state.offset;
> > +             num_chunks = 1 << (state.order);
> > +             // Skip mTHP orders that are not enabled
> > +             if (!test_bit(order, &enabled_orders))
> > +                     goto next;
> > +
> > +             // copy the relavant section to a new bitmap
> > +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> > +                               MTHP_BITMAP_SIZE);
> > +
> > +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> > +             threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> > +                             >> (HPAGE_PMD_ORDER - state.order);
> > +
> > +             //Check if the region is "almost full" based on the threshold
> > +             if (bits_set > threshold_bits || is_pmd_only
> > +                     || test_bit(order, &huge_anon_orders_always)) {
> > +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> > +                                     mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
> > +                     if (ret == SCAN_SUCCEED) {
> > +                             collapsed += (1 << order);
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +next:
> > +             if (state.order > 0) {
> > +                     next_order = state.order - 1;
> > +                     mid_offset = offset + (num_chunks / 2);
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, mid_offset };
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, offset };
> > +                     }
> > +     }
> > +     return collapsed;
> > +}
> > +
> >  static int collapse_scan_pmd(struct mm_struct *mm,
> >                                  struct vm_area_struct *vma,
> >                                  unsigned long address, bool *mmap_locked,
> > @@ -1444,9 +1522,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
> > --
> > 2.50.0
> >
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 08/14] khugepaged: skip collapsing mTHP to smaller orders
  2025-07-16 14:32   ` David Hildenbrand
@ 2025-07-17  7:24     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 8:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.25 02:32, Nico Pache wrote:
> > khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> > some pages being unmapped. Skip these cases until we have a way to check
> > if its ok to collapse to a smaller mTHP size (like in the case of a
> > partially mapped folio).
> >
> > This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 7 ++++++-
> >   1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5d7c5be9097e..a701d9f0f158 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -612,7 +612,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               folio = page_folio(page);
> >               VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >
> > -             /* See hpage_collapse_scan_pmd(). */
> > +             if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
> > +                     result = SCAN_PTE_MAPPED_HUGEPAGE;
> > +                     goto out;
> > +             }
>
> Probably worth adding a TODO in the code like
>
> /*
>   * TODO: In some cases of partially-mapped folios, we'd actually
>   * want to collapse.
>   */
Done! Good idea with these TODOs!
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thank you :)
>
> --
> Cheers,
>
> David / dhildenb
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders
  2025-07-16 15:28   ` David Hildenbrand
@ 2025-07-17  7:25     ` Nico Pache
  2025-07-18  8:40       ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-17  7:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 9:28 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.25 02:32, Nico Pache wrote:
> > From: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> Should the subject better be
>
> "mm/khugepaged: enable collapsing mTHPs even when PMD THPs are disabled"
Thank does read better.
>
> (in general, I assume all subjects should be prefixed by "mm/khugepaged:")
ehhh, seems like there's a mix of "mm/khugepaged", "khugepaged", and
"mm: khugepaged:" being used in other commits. I prefer using
khugepaged as it leaves me more space for the commit title
>
> >
> > We have now allowed mTHP collapse, but thp_vma_allowable_order() still only
> > checks if the PMD-sized mTHP is allowed to collapse. This prevents scanning
> > and collapsing of 64K mTHP when only 64K mTHP is enabled. Thus, we should
> > modify the checks to allow all large orders of anonymous mTHP.
> >
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 13 +++++++++----
> >   1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 7a9c4edf0e23..3772dc0d78ea 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -491,8 +491,11 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> >   {
> >       if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
> >           hugepage_pmd_enabled()) {
> > -             if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
> > -                                         PMD_ORDER))
> > +             unsigned long orders = vma_is_anonymous(vma) ?
> > +                                     THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
> > +
> > +             if (thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
> > +                                         orders))
> >                       __khugepaged_enter(vma->vm_mm);
> >       }
> >   }
> > @@ -2624,6 +2627,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> >
> >       vma_iter_init(&vmi, mm, khugepaged_scan.address);
> >       for_each_vma(vmi, vma) {
> > +             unsigned long orders = vma_is_anonymous(vma) ?
> > +                                     THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
> >               unsigned long hstart, hend;
> >
> >               cond_resched();
> > @@ -2631,8 +2636,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> >                       progress++;
> >                       break;
> >               }
> > -             if (!thp_vma_allowable_order(vma, vma->vm_flags,
> > -                                     TVA_ENFORCE_SYSFS, PMD_ORDER)) {
> > +             if (!thp_vma_allowable_orders(vma, vma->vm_flags,
> > +                     TVA_ENFORCE_SYSFS, orders)) {
> >   skip:
> >                       progress++;
> >                       continue;
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thank you for your review :)

>
> --
> Cheers,
>
> David / dhildenb
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-16 14:02   ` David Hildenbrand
  2025-07-17  7:23     ` Nico Pache
@ 2025-07-17 15:54     ` Lorenzo Stoakes
  1 sibling, 0 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 15:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

On Wed, Jul 16, 2025 at 04:02:43PM +0200, David Hildenbrand wrote:
> Doesn't compile. This is introduced way later in this series.
>
> Using something like
>
> git rebase -i mm/mm-unstable --exec "make -j16"
>
> You can efficiently make sure that individual patches compile cleanly.

Just to drive this home - I'm bisecting something and just hit this.

So this isn't just a theoretical thing :)


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 09/14] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-07-14  0:32 ` [PATCH v9 09/14] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-07-18  2:14   ` Baolin Wang
  2025-07-18 22:34     ` Nico Pache
  0 siblings, 1 reply; 51+ messages in thread
From: Baolin Wang @ 2025-07-18  2:14 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd



On 2025/7/14 08:32, Nico Pache wrote:
> There are cases where, if an attempted collapse fails, all subsequent
> orders are guaranteed to also fail. Avoid these collapse attempts by
> bailing out early.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a701d9f0f158..7a9c4edf0e23 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1367,6 +1367,23 @@ static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
>   				collapsed += (1 << order);
>   				continue;
>   			}

After doing more testing, I think you need to add the following changes 
after patch 8.

Because when collapsing mTHP, if we encounter a PTE-mapped large folio 
within the PMD range, we should continue scanning to complete that PMD, 
in case there is another mTHP that can be collapsed within that PMD range.

+                       if (ret == SCAN_PTE_MAPPED_HUGEPAGE)
+                               continue;

> +			/*
> +			 * Some ret values indicate all lower order will also
> +			 * fail, dont trying to collapse smaller orders
> +			 */
> +			if (ret == SCAN_EXCEED_NONE_PTE ||
> +				ret == SCAN_EXCEED_SWAP_PTE ||
> +				ret == SCAN_EXCEED_SHARED_PTE ||
> +				ret == SCAN_PTE_NON_PRESENT ||
> +				ret == SCAN_PTE_UFFD_WP ||
> +				ret == SCAN_ALLOC_HUGE_PAGE_FAIL ||
> +				ret == SCAN_CGROUP_CHARGE_FAIL ||
> +				ret == SCAN_COPY_MC ||
> +				ret == SCAN_PAGE_LOCK ||
> +				ret == SCAN_PAGE_COUNT)
> +				goto next;
> +			else

Nit: the 'else' statement can be dropped.

> +				break;
>   		}
>   
>   next:



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats
  2025-07-14  0:32 ` [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats Nico Pache
@ 2025-07-18  5:04   ` Baolin Wang
  2025-07-18 21:00     ` Nico Pache
  0 siblings, 1 reply; 51+ messages in thread
From: Baolin Wang @ 2025-07-18  5:04 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd



On 2025/7/14 08:32, Nico Pache wrote:
> With mTHP support inplace, let add the per-order mTHP stats for
> exceeding NONE, SWAP, and SHARED.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
>   include/linux/huge_mm.h                    |  3 +++
>   mm/huge_memory.c                           |  7 +++++++
>   mm/khugepaged.c                            | 15 ++++++++++++---
>   4 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 2c523dce6bc7..28c8af61efba 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -658,6 +658,23 @@ nr_anon_partially_mapped
>          an anonymous THP as "partially mapped" and count it here, even though it
>          is not actually partially mapped anymore.
>   
> +collapse_exceed_swap_pte
> +       The number of anonymous THP which contain at least one swap PTE.
> +       Currently khugepaged does not support collapsing mTHP regions that
> +       contain a swap PTE.
> +
> +collapse_exceed_none_pte
> +       The number of anonymous THP which have exceeded the none PTE threshold.
> +       With mTHP collapse, a bitmap is used to gather the state of a PMD region
> +       and is then recursively checked from largest to smallest order against
> +       the scaled max_ptes_none count. This counter indicates that the next
> +       enabled order will be checked.
> +
> +collapse_exceed_shared_pte
> +       The number of anonymous THP which contain at least one shared PTE.
> +       Currently khugepaged does not support collapsing mTHP regions that
> +       contain a shared PTE.
> +
>   As the system ages, allocating huge pages may be expensive as the
>   system uses memory compaction to copy data around memory to free a
>   huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4042078e8cc9..e0a27f80f390 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -141,6 +141,9 @@ enum mthp_stat_item {
>   	MTHP_STAT_SPLIT_DEFERRED,
>   	MTHP_STAT_NR_ANON,
>   	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>   	__MTHP_STAT_COUNT
>   };
>   
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2ed9493df77..57e5699cf638 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -632,6 +632,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>   DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>   DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>   DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +
>   
>   static struct attribute *anon_stats_attrs[] = {
>   	&anon_fault_alloc_attr.attr,
> @@ -648,6 +652,9 @@ static struct attribute *anon_stats_attrs[] = {
>   	&split_deferred_attr.attr,
>   	&nr_anon_attr.attr,
>   	&nr_anon_partially_mapped_attr.attr,
> +	&collapse_exceed_swap_pte_attr.attr,
> +	&collapse_exceed_none_pte_attr.attr,
> +	&collapse_exceed_shared_pte_attr.attr,
>   	NULL,
>   };
>   
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d0c99b86b304..8a5873d0a23a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -594,7 +594,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   				continue;
>   			} else {
>   				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				if (order == HPAGE_PMD_ORDER)
> +					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				else
> +					count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);

Please follow the same logic as other mTHP statistics, meaning there is 
no need to filter out PMD-sized orders, because mTHP also supports 
PMD-sized orders. So logic should be:

if (order == HPAGE_PMD_ORDER)
	count_vm_event(THP_SCAN_EXCEED_NONE_PTE);

count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);

>   				goto out;
>   			}
>   		}
> @@ -623,8 +626,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		/* See khugepaged_scan_pmd(). */
>   		if (folio_maybe_mapped_shared(folio)) {
>   			++shared;
> -			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared)) {
> +			if (order != HPAGE_PMD_ORDER) {
> +				result = SCAN_EXCEED_SHARED_PTE;
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +				goto out;
> +			}

Ditto.

> +
> +			if (cc->is_khugepaged &&
> +				shared > khugepaged_max_ptes_shared) {
>   				result = SCAN_EXCEED_SHARED_PTE;
>   				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>   				goto out;



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders
  2025-07-17  7:25     ` Nico Pache
@ 2025-07-18  8:40       ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-07-18  8:40 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 17.07.25 09:25, Nico Pache wrote:
> On Wed, Jul 16, 2025 at 9:28 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 14.07.25 02:32, Nico Pache wrote:
>>> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> Should the subject better be
>>
>> "mm/khugepaged: enable collapsing mTHPs even when PMD THPs are disabled"
> Thank does read better.
>>
>> (in general, I assume all subjects should be prefixed by "mm/khugepaged:")
> ehhh, seems like there's a mix of "mm/khugepaged", "khugepaged", and
> "mm: khugepaged:" being used in other commits. I prefer using
> khugepaged as it leaves me more space for the commit title

It's inconsistent, but we generally try to indicate the relevant 
subsystem (mm).  For khugepaged it's probably the case that it's not 
easy to confuse with another subsystem.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats
  2025-07-18  5:04   ` Baolin Wang
@ 2025-07-18 21:00     ` Nico Pache
  2025-07-19  4:42       ` Baolin Wang
  0 siblings, 1 reply; 51+ messages in thread
From: Nico Pache @ 2025-07-18 21:00 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

On Thu, Jul 17, 2025 at 11:05 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/14 08:32, Nico Pache wrote:
> > With mTHP support inplace, let add the per-order mTHP stats for
> > exceeding NONE, SWAP, and SHARED.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
> >   include/linux/huge_mm.h                    |  3 +++
> >   mm/huge_memory.c                           |  7 +++++++
> >   mm/khugepaged.c                            | 15 ++++++++++++---
> >   4 files changed, 39 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 2c523dce6bc7..28c8af61efba 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -658,6 +658,23 @@ nr_anon_partially_mapped
> >          an anonymous THP as "partially mapped" and count it here, even though it
> >          is not actually partially mapped anymore.
> >
> > +collapse_exceed_swap_pte
> > +       The number of anonymous THP which contain at least one swap PTE.
> > +       Currently khugepaged does not support collapsing mTHP regions that
> > +       contain a swap PTE.
> > +
> > +collapse_exceed_none_pte
> > +       The number of anonymous THP which have exceeded the none PTE threshold.
> > +       With mTHP collapse, a bitmap is used to gather the state of a PMD region
> > +       and is then recursively checked from largest to smallest order against
> > +       the scaled max_ptes_none count. This counter indicates that the next
> > +       enabled order will be checked.
> > +
> > +collapse_exceed_shared_pte
> > +       The number of anonymous THP which contain at least one shared PTE.
> > +       Currently khugepaged does not support collapsing mTHP regions that
> > +       contain a shared PTE.
> > +
> >   As the system ages, allocating huge pages may be expensive as the
> >   system uses memory compaction to copy data around memory to free a
> >   huge page for use. There are some counters in ``/proc/vmstat`` to help
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 4042078e8cc9..e0a27f80f390 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -141,6 +141,9 @@ enum mthp_stat_item {
> >       MTHP_STAT_SPLIT_DEFERRED,
> >       MTHP_STAT_NR_ANON,
> >       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> > +     MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> > +     MTHP_STAT_COLLAPSE_EXCEED_NONE,
> > +     MTHP_STAT_COLLAPSE_EXCEED_SHARED,
> >       __MTHP_STAT_COUNT
> >   };
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e2ed9493df77..57e5699cf638 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -632,6 +632,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> >   DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> >   DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> >   DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > +
> >
> >   static struct attribute *anon_stats_attrs[] = {
> >       &anon_fault_alloc_attr.attr,
> > @@ -648,6 +652,9 @@ static struct attribute *anon_stats_attrs[] = {
> >       &split_deferred_attr.attr,
> >       &nr_anon_attr.attr,
> >       &nr_anon_partially_mapped_attr.attr,
> > +     &collapse_exceed_swap_pte_attr.attr,
> > +     &collapse_exceed_none_pte_attr.attr,
> > +     &collapse_exceed_shared_pte_attr.attr,
> >       NULL,
> >   };
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index d0c99b86b304..8a5873d0a23a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -594,7 +594,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > +                             if (order == HPAGE_PMD_ORDER)
> > +                                     count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > +                             else
> > +                                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>
> Please follow the same logic as other mTHP statistics, meaning there is
> no need to filter out PMD-sized orders, because mTHP also supports
> PMD-sized orders. So logic should be:
>
> if (order == HPAGE_PMD_ORDER)
>         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>
> count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
Good point-- I will fix that!
>
> >                               goto out;
> >                       }
> >               }
> > @@ -623,8 +626,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               /* See khugepaged_scan_pmd(). */
> >               if (folio_maybe_mapped_shared(folio)) {
> >                       ++shared;
> > -                     if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> > -                         shared > khugepaged_max_ptes_shared)) {
> > +                     if (order != HPAGE_PMD_ORDER) {
> > +                             result = SCAN_EXCEED_SHARED_PTE;
> > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > +                             goto out;
> > +                     }
>
> Ditto.
Thanks!

There is also the SWAP one, which is slightly different as it is
calculated during the scan phase, and in the mTHP case in the swapin
faulting code. Not sure if during the scan phase we should also
increment the counter for the PMD order... or just leave it as a
general vm_event counter since it's not attributed to an order during
scan. I believe the latter is the correct approach and only attribute
an order to it in the __collapse_huge_page_swapin function if its mTHP
collapses.
>
> > +
> > +                     if (cc->is_khugepaged &&
> > +                             shared > khugepaged_max_ptes_shared) {
> >                               result = SCAN_EXCEED_SHARED_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >                               goto out;
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 09/14] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-07-18  2:14   ` Baolin Wang
@ 2025-07-18 22:34     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-18 22:34 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

On Thu, Jul 17, 2025 at 8:15 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/14 08:32, Nico Pache wrote:
> > There are cases where, if an attempted collapse fails, all subsequent
> > orders are guaranteed to also fail. Avoid these collapse attempts by
> > bailing out early.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 17 +++++++++++++++++
> >   1 file changed, 17 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index a701d9f0f158..7a9c4edf0e23 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1367,6 +1367,23 @@ static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> >                               collapsed += (1 << order);
> >                               continue;
> >                       }
>
> After doing more testing, I think you need to add the following changes
> after patch 8.
>
> Because when collapsing mTHP, if we encounter a PTE-mapped large folio
> within the PMD range, we should continue scanning to complete that PMD,
> in case there is another mTHP that can be collapsed within that PMD range.
>
> +                       if (ret == SCAN_PTE_MAPPED_HUGEPAGE)
> +                               continue;
Ah good call, this patch is meant to be an optimization to the formal
approach-- meaning there are cases where trying ('goto next') to
collapse to lower orders is pointless, but I didn't fully consider the
cases where trying the other items (sections of the PMD) in the stack
are viable (ie continue). I'm going to spend some time confirming all
the potential return values that can come from collapse_huge_page, and
which ones belong in each group (goto next, continue, break). This
will probably be turned into a switch statement.
>
> > +                     /*
> > +                      * Some ret values indicate all lower order will also
> > +                      * fail, dont trying to collapse smaller orders
> > +                      */
After reading this comment again, i realized it's rather confusing...
it makes it seem like these ret values are the ones that indicate that
we should not keep trying to collapse to smaller orders. I'll clean up
that comment too.
> > +                     if (ret == SCAN_EXCEED_NONE_PTE ||
> > +                             ret == SCAN_EXCEED_SWAP_PTE ||
> > +                             ret == SCAN_EXCEED_SHARED_PTE ||
> > +                             ret == SCAN_PTE_NON_PRESENT ||
> > +                             ret == SCAN_PTE_UFFD_WP ||
> > +                             ret == SCAN_ALLOC_HUGE_PAGE_FAIL ||
> > +                             ret == SCAN_CGROUP_CHARGE_FAIL ||
> > +                             ret == SCAN_COPY_MC ||
> > +                             ret == SCAN_PAGE_LOCK ||
> > +                             ret == SCAN_PAGE_COUNT)
> > +                             goto next;
> > +                     else
>
> Nit: the 'else' statement can be dropped.
>
> > +                             break;
> >               }
> >
> >   next:
While i'm at it i'll change this to next_order to be more clear.

Thank you !!
-- Nico
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats
  2025-07-18 21:00     ` Nico Pache
@ 2025-07-19  4:42       ` Baolin Wang
  0 siblings, 0 replies; 51+ messages in thread
From: Baolin Wang @ 2025-07-19  4:42 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd



On 2025/7/19 05:00, Nico Pache wrote:
> On Thu, Jul 17, 2025 at 11:05 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/7/14 08:32, Nico Pache wrote:
>>> With mTHP support inplace, let add the per-order mTHP stats for
>>> exceeding NONE, SWAP, and SHARED.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>    Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
>>>    include/linux/huge_mm.h                    |  3 +++
>>>    mm/huge_memory.c                           |  7 +++++++
>>>    mm/khugepaged.c                            | 15 ++++++++++++---
>>>    4 files changed, 39 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>> index 2c523dce6bc7..28c8af61efba 100644
>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>> @@ -658,6 +658,23 @@ nr_anon_partially_mapped
>>>           an anonymous THP as "partially mapped" and count it here, even though it
>>>           is not actually partially mapped anymore.
>>>
>>> +collapse_exceed_swap_pte
>>> +       The number of anonymous THP which contain at least one swap PTE.
>>> +       Currently khugepaged does not support collapsing mTHP regions that
>>> +       contain a swap PTE.
>>> +
>>> +collapse_exceed_none_pte
>>> +       The number of anonymous THP which have exceeded the none PTE threshold.
>>> +       With mTHP collapse, a bitmap is used to gather the state of a PMD region
>>> +       and is then recursively checked from largest to smallest order against
>>> +       the scaled max_ptes_none count. This counter indicates that the next
>>> +       enabled order will be checked.
>>> +
>>> +collapse_exceed_shared_pte
>>> +       The number of anonymous THP which contain at least one shared PTE.
>>> +       Currently khugepaged does not support collapsing mTHP regions that
>>> +       contain a shared PTE.
>>> +
>>>    As the system ages, allocating huge pages may be expensive as the
>>>    system uses memory compaction to copy data around memory to free a
>>>    huge page for use. There are some counters in ``/proc/vmstat`` to help
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 4042078e8cc9..e0a27f80f390 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -141,6 +141,9 @@ enum mthp_stat_item {
>>>        MTHP_STAT_SPLIT_DEFERRED,
>>>        MTHP_STAT_NR_ANON,
>>>        MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
>>> +     MTHP_STAT_COLLAPSE_EXCEED_SWAP,
>>> +     MTHP_STAT_COLLAPSE_EXCEED_NONE,
>>> +     MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>>>        __MTHP_STAT_COUNT
>>>    };
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index e2ed9493df77..57e5699cf638 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -632,6 +632,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>>>    DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>>>    DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>>>    DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
>>> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>>> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>>> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>>> +
>>>
>>>    static struct attribute *anon_stats_attrs[] = {
>>>        &anon_fault_alloc_attr.attr,
>>> @@ -648,6 +652,9 @@ static struct attribute *anon_stats_attrs[] = {
>>>        &split_deferred_attr.attr,
>>>        &nr_anon_attr.attr,
>>>        &nr_anon_partially_mapped_attr.attr,
>>> +     &collapse_exceed_swap_pte_attr.attr,
>>> +     &collapse_exceed_none_pte_attr.attr,
>>> +     &collapse_exceed_shared_pte_attr.attr,
>>>        NULL,
>>>    };
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d0c99b86b304..8a5873d0a23a 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -594,7 +594,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>                                continue;
>>>                        } else {
>>>                                result = SCAN_EXCEED_NONE_PTE;
>>> -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>> +                             if (order == HPAGE_PMD_ORDER)
>>> +                                     count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>> +                             else
>>> +                                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>>
>> Please follow the same logic as other mTHP statistics, meaning there is
>> no need to filter out PMD-sized orders, because mTHP also supports
>> PMD-sized orders. So logic should be:
>>
>> if (order == HPAGE_PMD_ORDER)
>>          count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>
>> count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> Good point-- I will fix that!
>>
>>>                                goto out;
>>>                        }
>>>                }
>>> @@ -623,8 +626,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>                /* See khugepaged_scan_pmd(). */
>>>                if (folio_maybe_mapped_shared(folio)) {
>>>                        ++shared;
>>> -                     if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
>>> -                         shared > khugepaged_max_ptes_shared)) {
>>> +                     if (order != HPAGE_PMD_ORDER) {
>>> +                             result = SCAN_EXCEED_SHARED_PTE;
>>> +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>>> +                             goto out;
>>> +                     }
>>
>> Ditto.
> Thanks!
> 
> There is also the SWAP one, which is slightly different as it is
> calculated during the scan phase, and in the mTHP case in the swapin
> faulting code. Not sure if during the scan phase we should also
> increment the counter for the PMD order... or just leave it as a
> general vm_event counter since it's not attributed to an order during
> scan. I believe the latter is the correct approach and only attribute
> an order to it in the __collapse_huge_page_swapin function if its mTHP
> collapses.

Yes, that latter approach sounds reasonable to me.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 12/14] khugepaged: improve tracepoints for mTHP orders
  2025-07-14  0:32 ` [PATCH v9 12/14] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-07-22 15:39   ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-07-22 15:39 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On 14.07.25 02:32, Nico Pache wrote:
> Add the order to the tracepoints to give better insight into what order
> is being operated at for khugepaged.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

[...]

>   
>   TRACE_EVENT(mm_collapse_huge_page_swapin,
>   
> -	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
> +	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
> +			int order),


Indentation.

[...]

> +++ b/mm/khugepaged.c
> @@ -711,13 +711,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   	} else {
>   		result = SCAN_SUCCEED;
>   		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> -						    referenced, writable, result);
> +						    referenced, writable, result,
> +						    order);
>   		return result;
>   	}
>   out:
>   	release_pte_pages(pte, _pte, compound_pagelist);
>   	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> -					    referenced, writable, result);
> +					    referenced, writable, result, order);
>   	return result;
>   }
>   
> @@ -1097,7 +1098,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   
>   	result = SCAN_SUCCEED;
>   out:
> -	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
> +	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
> +						order);

Dito.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-07-16 15:12   ` Liam R. Howlett
@ 2025-07-23  1:55     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-23  1:55 UTC (permalink / raw)
  To: Liam R. Howlett, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, david, ziy, baolin.wang, lorenzo.stoakes,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd

On Wed, Jul 16, 2025 at 9:14 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Nico Pache <npache@redhat.com> [250713 20:33]:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create collapse_single_pmd to increase code reuse and create an entry
> > point to these two users.
> >
> > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > collapse_single_pmd function. This introduces a minor behavioral change
> > that is most likely an undiscovered bug. The current implementation of
> > khugepaged tests collapse_test_exit_or_disable before calling
> > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > case. By unifying these two callers madvise_collapse now also performs
> > this check.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 95 +++++++++++++++++++++++++------------------------
> >  1 file changed, 49 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index eb0babb51868..47a80638af97 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2362,6 +2362,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >  }
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static int collapse_single_pmd(unsigned long addr,
> > +                                struct vm_area_struct *vma, bool *mmap_locked,
> > +                                struct collapse_control *cc)
> > +{
> > +     int result = SCAN_FAIL;
> > +     struct mm_struct *mm = vma->vm_mm;
> > +
> > +     if (!vma_is_anonymous(vma)) {
> > +             struct file *file = get_file(vma->vm_file);
> > +             pgoff_t pgoff = linear_page_index(vma, addr);
> > +
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
>
> Okay, just for my sanity, when we reach this part.. mmap_locked will
> be false on return.  Because we set it a bunch more below.. but it's
> always false on return.
>
> Although this is cleaner implementation of the lock, I'm just not sure
> why you keep flipping the mmap_locked variable here?  We could probably
> get away with comments that it will always be false.
>
>
> > +             result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > +             fput(file);
> > +             if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > +                     mmap_read_lock(mm);
> > +                     *mmap_locked = true;
> > +                     if (collapse_test_exit_or_disable(mm)) {
> > +                             mmap_read_unlock(mm);
> > +                             *mmap_locked = false;
> > +                             result = SCAN_ANY_PROCESS;
> > +                             goto end;
> > +                     }
> > +                     result = collapse_pte_mapped_thp(mm, addr,
> > +                                                      !cc->is_khugepaged);
> > +                     if (result == SCAN_PMD_MAPPED)
> > +                             result = SCAN_SUCCEED;
> > +                     mmap_read_unlock(mm);
> > +                     *mmap_locked = false;
> > +             }
> > +     } else {
> > +             result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > +     }
> > +     if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > +             ++khugepaged_pages_collapsed;
> > +end:
> > +     return result;
> > +}
> > +
> >  static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> >                                           struct collapse_control *cc)
> >       __releases(&khugepaged_mm_lock)
> > @@ -2436,34 +2480,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> >                       VM_BUG_ON(khugepaged_scan.address < hstart ||
> >                                 khugepaged_scan.address + HPAGE_PMD_SIZE >
> >                                 hend);
> > -                     if (!vma_is_anonymous(vma)) {
> > -                             struct file *file = get_file(vma->vm_file);
> > -                             pgoff_t pgoff = linear_page_index(vma,
> > -                                             khugepaged_scan.address);
> > -
> > -                             mmap_read_unlock(mm);
> > -                             mmap_locked = false;
> > -                             *result = hpage_collapse_scan_file(mm,
> > -                                     khugepaged_scan.address, file, pgoff, cc);
> > -                             fput(file);
> > -                             if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > -                                     mmap_read_lock(mm);
> > -                                     if (hpage_collapse_test_exit_or_disable(mm))
> > -                                             goto breakouterloop;
> > -                                     *result = collapse_pte_mapped_thp(mm,
> > -                                             khugepaged_scan.address, false);
> > -                                     if (*result == SCAN_PMD_MAPPED)
> > -                                             *result = SCAN_SUCCEED;
> > -                                     mmap_read_unlock(mm);
> > -                             }
> > -                     } else {
> > -                             *result = hpage_collapse_scan_pmd(mm, vma,
> > -                                     khugepaged_scan.address, &mmap_locked, cc);
> > -                     }
> > -
> > -                     if (*result == SCAN_SUCCEED)
> > -                             ++khugepaged_pages_collapsed;
> >
> > +                     *result = collapse_single_pmd(khugepaged_scan.address,
> > +                                             vma, &mmap_locked, cc);
> >                       /* move to next address */
> >                       khugepaged_scan.address += HPAGE_PMD_SIZE;
> >                       progress += HPAGE_PMD_NR;
> > @@ -2780,35 +2799,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >               mmap_assert_locked(mm);
> >               memset(cc->node_load, 0, sizeof(cc->node_load));
> >               nodes_clear(cc->alloc_nmask);
> > -             if (!vma_is_anonymous(vma)) {
> > -                     struct file *file = get_file(vma->vm_file);
> > -                     pgoff_t pgoff = linear_page_index(vma, addr);
> >
> > -                     mmap_read_unlock(mm);
> > -                     mmap_locked = false;
> > -                     result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > -                                                       cc);
> > -                     fput(file);
> > -             } else {
> > -                     result = hpage_collapse_scan_pmd(mm, vma, addr,
> > -                                                      &mmap_locked, cc);
> > -             }
> > +             result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > +
> >               if (!mmap_locked)
> >                       *lock_dropped = true;
>
> All of this locking is scary, because there are comments everywhere that
> imply that mmap_locked indicates that the lock was dropped at some
> point, but we are using it to indicate that the lock is currently held -
> which are very different things..
>
> Here, for example locked_dropped may not be set to true event though we
> have toggled it through collapse_single_pmd() -> collapse_scan_pmd() ->
> ... -> collapse_huge_page().
>
> Maybe these scenarios are safe because of known limitations of what will
> or will not happen, but the code paths existing without a comment about
> why it is safe seems like a good way to introduce races later.
You bring up some good points here...

Its actually rather confusing why we have this flag, and what purpose
it serves with relation to those comments (other than tracking the
state of the mmap_lock). You are correct that it states to indicate
that the lock was dropped at some point, but in practice that is not
the case, there are multiple cases of the lock being dropped and
reacquired without flipping the flag to false. So in practice I dont
think that's the way it's actually working.

I did some digging and it was introduced in 50ad2f24b3b4
("mm/khugepaged: propagate enum scan_result codes back to callers"),
and AFAICT it was introduced to so we can properly indicate the return
value/state to the parent callers and those comments are incorrect.

"Since khugepaged_scan_pmd()'s return value already has a specific meaning
    (whether mmap_lock was unlocked or not), add a bool* argument to
    khugepaged_scan_pmd() to retrieve this information."

Cheers,
-- Nico
>
> >
> > -handle_result:
> >               switch (result) {
> >               case SCAN_SUCCEED:
> >               case SCAN_PMD_MAPPED:
> >                       ++thps;
> >                       break;
> > -             case SCAN_PTE_MAPPED_HUGEPAGE:
> > -                     BUG_ON(mmap_locked);
> > -                     mmap_read_lock(mm);
> > -                     result = collapse_pte_mapped_thp(mm, addr, true);
> > -                     mmap_read_unlock(mm);
> > -                     goto handle_result;
> >               /* Whitelisted set of results where continuing OK */
> > +             case SCAN_PTE_MAPPED_HUGEPAGE:
> >               case SCAN_PMD_NULL:
> >               case SCAN_PTE_NON_PRESENT:
> >               case SCAN_PTE_UFFD_WP:
> > --
> > 2.50.0
> >
> >
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-07-15 15:53   ` David Hildenbrand
@ 2025-07-23  1:56     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-23  1:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap, hughd

On Tue, Jul 15, 2025 at 9:54 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.25 02:31, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create collapse_single_pmd to increase code reuse and create an entry
> > point to these two users.
> >
> > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > collapse_single_pmd function. This introduces a minor behavioral change
> > that is most likely an undiscovered bug. The current implementation of
> > khugepaged tests collapse_test_exit_or_disable before calling
> > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > case. By unifying these two callers madvise_collapse now also performs
> > this check.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 95 +++++++++++++++++++++++++------------------------
> >   1 file changed, 49 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index eb0babb51868..47a80638af97 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2362,6 +2362,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >   }
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static int collapse_single_pmd(unsigned long addr,
> > +                                struct vm_area_struct *vma, bool *mmap_locked,
> > +                                struct collapse_control *cc)
>
> Nit: we tend to use two-tabs indent here.
Thanks I cleaned up these indentations!
>
> Nice cleanup!
>
> Acked-by: David Hildenbrand <david@redhat.com>
Much appreciated :)
>
> --
> Cheers,
>
> David / dhildenb
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-14  0:31 ` [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
  2025-07-16 13:52   ` David Hildenbrand
  2025-07-16 14:02   ` David Hildenbrand
@ 2025-07-25 16:09   ` Lorenzo Stoakes
  2025-07-25 22:37     ` Nico Pache
  2 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 16:09 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

FYI this seems to conflict on mm-new with Dev's "khugepaged: optimize
__collapse_huge_page_copy_succeeded() by PTE batching" patch.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_*
  2025-07-14  0:31 ` [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
  2025-07-15 15:39   ` David Hildenbrand
  2025-07-16 14:29   ` Liam R. Howlett
@ 2025-07-25 16:43   ` Lorenzo Stoakes
  2025-07-25 22:35     ` Nico Pache
  2 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 16:43 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

Hm it seems you missed some places:

mm/khugepaged.c: In function ‘collapse_scan_mm_slot’:
mm/khugepaged.c:2466:43: error: implicit declaration of function ‘hpage_collapse_scan_file’; did you mean ‘collapse_scan_file’? [-Wimplicit-function-declaration]
 2466 |                                 *result = hpage_collapse_scan_file(mm,
      |                                           ^~~~~~~~~~~~~~~~~~~~~~~~
      |                                           collapse_scan_file
mm/khugepaged.c:2471:45: error: implicit declaration of function ‘hpage_collapse_test_exit_or_disable’; did you mean ‘collapse_test_exit_or_disable’? [-Wimplicit-function-declaration]
 2471 |                                         if (hpage_collapse_test_exit_or_disable(mm))
      |                                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                             collapse_test_exit_or_disable
mm/khugepaged.c:2480:43: error: implicit declaration of function ‘hpage_collapse_scan_pmd’; did you mean ‘collapse_scan_pmd’? [-Wimplicit-function-declaration]
 2480 |                                 *result = hpage_collapse_scan_pmd(mm, vma,
      |                                           ^~~~~~~~~~~~~~~~~~~~~~~
      |                                           collapse_scan_pmd
mm/khugepaged.c: At top level:
mm/khugepaged.c:2278:12: error: ‘collapse_scan_file’ defined but not used [-Werror=unused-function]
 2278 | static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
      |            ^~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1271:12: error: ‘collapse_scan_pmd’ defined but not used [-Werror=unused-function]
 1271 | static int collapse_scan_pmd(struct mm_struct *mm,
      |            ^~~~~~~~~~~~~~~~~

Other than this it LGTM, so once you fix this stuff up you can get a tag :)

On Sun, Jul 13, 2025 at 06:31:54PM -0600, Nico Pache wrote:
> The hpage_collapse functions describe functions used by madvise_collapse
> and khugepaged. remove the unnecessary hpage prefix to shorten the
> function name.
>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a55fb1dcd224..eb0babb51868 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
>  	kmem_cache_destroy(mm_slot_cache);
>  }
>
> -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> +static inline int collapse_test_exit(struct mm_struct *mm)
>  {
>  	return atomic_read(&mm->mm_users) == 0;
>  }
>
> -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> +static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>  {
> -	return hpage_collapse_test_exit(mm) ||
> +	return collapse_test_exit(mm) ||
>  	       test_bit(MMF_DISABLE_THP, &mm->flags);
>  }
>
> @@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
>  	int wakeup;
>
>  	/* __khugepaged_exit() must not run from under us */
> -	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> +	VM_BUG_ON_MM(collapse_test_exit(mm), mm);
>  	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
>  		return;
>
> @@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  	} else if (mm_slot) {
>  		/*
>  		 * This is required to serialize against
> -		 * hpage_collapse_test_exit() (which is guaranteed to run
> +		 * collapse_test_exit() (which is guaranteed to run
>  		 * under mmap sem read mode). Stop here (after we return all
>  		 * pagetables will be destroyed) until khugepaged has finished
>  		 * working on the pagetables under the mmap_lock.
> @@ -838,7 +838,7 @@ struct collapse_control khugepaged_collapse_control = {
>  	.is_khugepaged = true,
>  };
>
> -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> +static bool collapse_scan_abort(int nid, struct collapse_control *cc)
>  {
>  	int i;
>
> @@ -873,7 +873,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>  }
>
>  #ifdef CONFIG_NUMA
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int collapse_find_target_node(struct collapse_control *cc)
>  {
>  	int nid, target_node = 0, max_value = 0;
>
> @@ -892,7 +892,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
>  	return target_node;
>  }
>  #else
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int collapse_find_target_node(struct collapse_control *cc)
>  {
>  	return 0;
>  }
> @@ -912,7 +912,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  	struct vm_area_struct *vma;
>  	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>
> -	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +	if (unlikely(collapse_test_exit_or_disable(mm)))
>  		return SCAN_ANY_PROCESS;
>
>  	*vmap = vma = find_vma(mm, address);
> @@ -985,7 +985,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
> - * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
>   *
>   * Called and returns without pte mapped or spinlocks held.
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> @@ -1071,7 +1071,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  {
>  	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>  		     GFP_TRANSHUGE);
> -	int node = hpage_collapse_find_target_node(cc);
> +	int node = collapse_find_target_node(cc);
>  	struct folio *folio;
>
>  	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> @@ -1257,7 +1257,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	return result;
>  }
>
> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> +static int collapse_scan_pmd(struct mm_struct *mm,
>  				   struct vm_area_struct *vma,
>  				   unsigned long address, bool *mmap_locked,
>  				   struct collapse_control *cc)
> @@ -1371,7 +1371,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		 * hit record.
>  		 */
>  		node = folio_nid(folio);
> -		if (hpage_collapse_scan_abort(node, cc)) {
> +		if (collapse_scan_abort(node, cc)) {
>  			result = SCAN_SCAN_ABORT;
>  			goto out_unmap;
>  		}
> @@ -1440,7 +1440,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>
>  	lockdep_assert_held(&khugepaged_mm_lock);
>
> -	if (hpage_collapse_test_exit(mm)) {
> +	if (collapse_test_exit(mm)) {
>  		/* free mm_slot */
>  		hash_del(&slot->hash);
>  		list_del(&slot->mm_node);
> @@ -1733,7 +1733,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
>  			continue;
>
> -		if (hpage_collapse_test_exit(mm))
> +		if (collapse_test_exit(mm))
>  			continue;
>  		/*
>  		 * When a vma is registered with uffd-wp, we cannot recycle
> @@ -2255,7 +2255,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>
> -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> +static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  				    struct file *file, pgoff_t start,
>  				    struct collapse_control *cc)
>  {
> @@ -2312,7 +2312,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  		}
>
>  		node = folio_nid(folio);
> -		if (hpage_collapse_scan_abort(node, cc)) {
> +		if (collapse_scan_abort(node, cc)) {
>  			result = SCAN_SCAN_ABORT;
>  			folio_put(folio);
>  			break;
> @@ -2362,7 +2362,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>
> -static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> +static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>  					    struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
>  	__acquires(&khugepaged_mm_lock)
> @@ -2400,7 +2400,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  		goto breakouterloop_mmap_lock;
>
>  	progress++;
> -	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +	if (unlikely(collapse_test_exit_or_disable(mm)))
>  		goto breakouterloop;
>
>  	vma_iter_init(&vmi, mm, khugepaged_scan.address);
> @@ -2408,7 +2408,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  		unsigned long hstart, hend;
>
>  		cond_resched();
> -		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> +		if (unlikely(collapse_test_exit_or_disable(mm))) {
>  			progress++;
>  			break;
>  		}
> @@ -2430,7 +2430,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			bool mmap_locked = true;
>
>  			cond_resched();
> -			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +			if (unlikely(collapse_test_exit_or_disable(mm)))
>  				goto breakouterloop;
>
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
> @@ -2490,7 +2490,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  	 * Release the current mm_slot if this mm is about to die, or
>  	 * if we scanned all vmas of this mm.
>  	 */
> -	if (hpage_collapse_test_exit(mm) || !vma) {
> +	if (collapse_test_exit(mm) || !vma) {
>  		/*
>  		 * Make sure that if mm_users is reaching zero while
>  		 * khugepaged runs here, khugepaged_exit will find
> @@ -2544,7 +2544,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
>  			pass_through_head++;
>  		if (khugepaged_has_work() &&
>  		    pass_through_head < 2)
> -			progress += khugepaged_scan_mm_slot(pages - progress,
> +			progress += collapse_scan_mm_slot(pages - progress,
>  							    &result, cc);
>  		else
>  			progress = pages;
> --
> 2.50.0
>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_*
  2025-07-25 16:43   ` Lorenzo Stoakes
@ 2025-07-25 22:35     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-25 22:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

On Fri, Jul 25, 2025 at 10:44 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Hm it seems you missed some places:
Hehe I did notice that after running the git rebase --exec script
David showed me. Already fixed, it will be in the V10!
>
> mm/khugepaged.c: In function ‘collapse_scan_mm_slot’:
> mm/khugepaged.c:2466:43: error: implicit declaration of function ‘hpage_collapse_scan_file’; did you mean ‘collapse_scan_file’? [-Wimplicit-function-declaration]
>  2466 |                                 *result = hpage_collapse_scan_file(mm,
>       |                                           ^~~~~~~~~~~~~~~~~~~~~~~~
>       |                                           collapse_scan_file
> mm/khugepaged.c:2471:45: error: implicit declaration of function ‘hpage_collapse_test_exit_or_disable’; did you mean ‘collapse_test_exit_or_disable’? [-Wimplicit-function-declaration]
>  2471 |                                         if (hpage_collapse_test_exit_or_disable(mm))
>       |                                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>       |                                             collapse_test_exit_or_disable
> mm/khugepaged.c:2480:43: error: implicit declaration of function ‘hpage_collapse_scan_pmd’; did you mean ‘collapse_scan_pmd’? [-Wimplicit-function-declaration]
>  2480 |                                 *result = hpage_collapse_scan_pmd(mm, vma,
>       |                                           ^~~~~~~~~~~~~~~~~~~~~~~
>       |                                           collapse_scan_pmd
> mm/khugepaged.c: At top level:
> mm/khugepaged.c:2278:12: error: ‘collapse_scan_file’ defined but not used [-Werror=unused-function]
>  2278 | static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>       |            ^~~~~~~~~~~~~~~~~~
> mm/khugepaged.c:1271:12: error: ‘collapse_scan_pmd’ defined but not used [-Werror=unused-function]
>  1271 | static int collapse_scan_pmd(struct mm_struct *mm,
>       |            ^~~~~~~~~~~~~~~~~
>
> Other than this it LGTM, so once you fix this stuff up you can get a tag :)
Awesome Thanks!
>
> On Sun, Jul 13, 2025 at 06:31:54PM -0600, Nico Pache wrote:
> > The hpage_collapse functions describe functions used by madvise_collapse
> > and khugepaged. remove the unnecessary hpage prefix to shorten the
> > function name.
> >
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 46 +++++++++++++++++++++++-----------------------
> >  1 file changed, 23 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index a55fb1dcd224..eb0babb51868 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
> >       kmem_cache_destroy(mm_slot_cache);
> >  }
> >
> > -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> > +static inline int collapse_test_exit(struct mm_struct *mm)
> >  {
> >       return atomic_read(&mm->mm_users) == 0;
> >  }
> >
> > -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> > +static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
> >  {
> > -     return hpage_collapse_test_exit(mm) ||
> > +     return collapse_test_exit(mm) ||
> >              test_bit(MMF_DISABLE_THP, &mm->flags);
> >  }
> >
> > @@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
> >       int wakeup;
> >
> >       /* __khugepaged_exit() must not run from under us */
> > -     VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> > +     VM_BUG_ON_MM(collapse_test_exit(mm), mm);
> >       if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
> >               return;
> >
> > @@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
> >       } else if (mm_slot) {
> >               /*
> >                * This is required to serialize against
> > -              * hpage_collapse_test_exit() (which is guaranteed to run
> > +              * collapse_test_exit() (which is guaranteed to run
> >                * under mmap sem read mode). Stop here (after we return all
> >                * pagetables will be destroyed) until khugepaged has finished
> >                * working on the pagetables under the mmap_lock.
> > @@ -838,7 +838,7 @@ struct collapse_control khugepaged_collapse_control = {
> >       .is_khugepaged = true,
> >  };
> >
> > -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> > +static bool collapse_scan_abort(int nid, struct collapse_control *cc)
> >  {
> >       int i;
> >
> > @@ -873,7 +873,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> > +static int collapse_find_target_node(struct collapse_control *cc)
> >  {
> >       int nid, target_node = 0, max_value = 0;
> >
> > @@ -892,7 +892,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
> >       return target_node;
> >  }
> >  #else
> > -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> > +static int collapse_find_target_node(struct collapse_control *cc)
> >  {
> >       return 0;
> >  }
> > @@ -912,7 +912,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >       struct vm_area_struct *vma;
> >       unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >
> > -     if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > +     if (unlikely(collapse_test_exit_or_disable(mm)))
> >               return SCAN_ANY_PROCESS;
> >
> >       *vmap = vma = find_vma(mm, address);
> > @@ -985,7 +985,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> >
> >  /*
> >   * Bring missing pages in from swap, to complete THP collapse.
> > - * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> > + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> >   *
> >   * Called and returns without pte mapped or spinlocks held.
> >   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> > @@ -1071,7 +1071,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >  {
> >       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> >                    GFP_TRANSHUGE);
> > -     int node = hpage_collapse_find_target_node(cc);
> > +     int node = collapse_find_target_node(cc);
> >       struct folio *folio;
> >
> >       folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> > @@ -1257,7 +1257,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       return result;
> >  }
> >
> > -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > +static int collapse_scan_pmd(struct mm_struct *mm,
> >                                  struct vm_area_struct *vma,
> >                                  unsigned long address, bool *mmap_locked,
> >                                  struct collapse_control *cc)
> > @@ -1371,7 +1371,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >                * hit record.
> >                */
> >               node = folio_nid(folio);
> > -             if (hpage_collapse_scan_abort(node, cc)) {
> > +             if (collapse_scan_abort(node, cc)) {
> >                       result = SCAN_SCAN_ABORT;
> >                       goto out_unmap;
> >               }
> > @@ -1440,7 +1440,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> >
> >       lockdep_assert_held(&khugepaged_mm_lock);
> >
> > -     if (hpage_collapse_test_exit(mm)) {
> > +     if (collapse_test_exit(mm)) {
> >               /* free mm_slot */
> >               hash_del(&slot->hash);
> >               list_del(&slot->mm_node);
> > @@ -1733,7 +1733,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> >                       continue;
> >
> > -             if (hpage_collapse_test_exit(mm))
> > +             if (collapse_test_exit(mm))
> >                       continue;
> >               /*
> >                * When a vma is registered with uffd-wp, we cannot recycle
> > @@ -2255,7 +2255,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >  }
> >
> > -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > +static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >                                   struct file *file, pgoff_t start,
> >                                   struct collapse_control *cc)
> >  {
> > @@ -2312,7 +2312,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >               }
> >
> >               node = folio_nid(folio);
> > -             if (hpage_collapse_scan_abort(node, cc)) {
> > +             if (collapse_scan_abort(node, cc)) {
> >                       result = SCAN_SCAN_ABORT;
> >                       folio_put(folio);
> >                       break;
> > @@ -2362,7 +2362,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >  }
> >
> > -static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > +static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> >                                           struct collapse_control *cc)
> >       __releases(&khugepaged_mm_lock)
> >       __acquires(&khugepaged_mm_lock)
> > @@ -2400,7 +2400,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >               goto breakouterloop_mmap_lock;
> >
> >       progress++;
> > -     if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > +     if (unlikely(collapse_test_exit_or_disable(mm)))
> >               goto breakouterloop;
> >
> >       vma_iter_init(&vmi, mm, khugepaged_scan.address);
> > @@ -2408,7 +2408,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >               unsigned long hstart, hend;
> >
> >               cond_resched();
> > -             if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> > +             if (unlikely(collapse_test_exit_or_disable(mm))) {
> >                       progress++;
> >                       break;
> >               }
> > @@ -2430,7 +2430,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                       bool mmap_locked = true;
> >
> >                       cond_resched();
> > -                     if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > +                     if (unlikely(collapse_test_exit_or_disable(mm)))
> >                               goto breakouterloop;
> >
> >                       VM_BUG_ON(khugepaged_scan.address < hstart ||
> > @@ -2490,7 +2490,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >        * Release the current mm_slot if this mm is about to die, or
> >        * if we scanned all vmas of this mm.
> >        */
> > -     if (hpage_collapse_test_exit(mm) || !vma) {
> > +     if (collapse_test_exit(mm) || !vma) {
> >               /*
> >                * Make sure that if mm_users is reaching zero while
> >                * khugepaged runs here, khugepaged_exit will find
> > @@ -2544,7 +2544,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
> >                       pass_through_head++;
> >               if (khugepaged_has_work() &&
> >                   pass_through_head < 2)
> > -                     progress += khugepaged_scan_mm_slot(pages - progress,
> > +                     progress += collapse_scan_mm_slot(pages - progress,
> >                                                           &result, cc);
> >               else
> >                       progress = pages;
> > --
> > 2.50.0
> >
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-07-25 16:09   ` Lorenzo Stoakes
@ 2025-07-25 22:37     ` Nico Pache
  0 siblings, 0 replies; 51+ messages in thread
From: Nico Pache @ 2025-07-25 22:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd

On Fri, Jul 25, 2025 at 10:15 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> FYI this seems to conflict on mm-new with Dev's "khugepaged: optimize
> __collapse_huge_page_copy_succeeded() by PTE batching" patch.
Yes, I did notice this last time I pulled. I haven't taken the time to
resolve it, but will shortly. I want to make sure I do it correctly
and that these two dont conflict in any undesirable way !

Cheers,
-- Nico
>



^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-07-25 22:37 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14  0:31 [PATCH v9 00/14] khugepaged: mTHP support Nico Pache
2025-07-14  0:31 ` [PATCH v9 01/14] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2025-07-15 15:39   ` David Hildenbrand
2025-07-16 14:29   ` Liam R. Howlett
2025-07-16 15:20     ` David Hildenbrand
2025-07-17  7:21     ` Nico Pache
2025-07-25 16:43   ` Lorenzo Stoakes
2025-07-25 22:35     ` Nico Pache
2025-07-14  0:31 ` [PATCH v9 02/14] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-07-15 15:53   ` David Hildenbrand
2025-07-23  1:56     ` Nico Pache
2025-07-16 15:12   ` Liam R. Howlett
2025-07-23  1:55     ` Nico Pache
2025-07-14  0:31 ` [PATCH v9 03/14] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-07-15 15:55   ` David Hildenbrand
2025-07-14  0:31 ` [PATCH v9 04/14] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-07-16 13:46   ` David Hildenbrand
2025-07-17  7:22     ` Nico Pache
2025-07-14  0:31 ` [PATCH v9 05/14] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-07-16 13:52   ` David Hildenbrand
2025-07-17  7:22     ` Nico Pache
2025-07-16 14:02   ` David Hildenbrand
2025-07-17  7:23     ` Nico Pache
2025-07-17 15:54     ` Lorenzo Stoakes
2025-07-25 16:09   ` Lorenzo Stoakes
2025-07-25 22:37     ` Nico Pache
2025-07-14  0:31 ` [PATCH v9 06/14] khugepaged: introduce collapse_scan_bitmap " Nico Pache
2025-07-16 14:03   ` David Hildenbrand
2025-07-17  7:23     ` Nico Pache
2025-07-16 15:38   ` Liam R. Howlett
2025-07-17  7:24     ` Nico Pache
2025-07-14  0:32 ` [PATCH v9 07/14] khugepaged: add " Nico Pache
2025-07-14  0:32 ` [PATCH v9 08/14] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-07-16 14:32   ` David Hildenbrand
2025-07-17  7:24     ` Nico Pache
2025-07-14  0:32 ` [PATCH v9 09/14] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-07-18  2:14   ` Baolin Wang
2025-07-18 22:34     ` Nico Pache
2025-07-14  0:32 ` [PATCH v9 10/14] khugepaged: allow khugepaged to check all anonymous mTHP orders Nico Pache
2025-07-16 15:28   ` David Hildenbrand
2025-07-17  7:25     ` Nico Pache
2025-07-18  8:40       ` David Hildenbrand
2025-07-14  0:32 ` [PATCH v9 11/14] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
2025-07-14  0:32 ` [PATCH v9 12/14] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-07-22 15:39   ` David Hildenbrand
2025-07-14  0:32 ` [PATCH v9 13/14] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-07-18  5:04   ` Baolin Wang
2025-07-18 21:00     ` Nico Pache
2025-07-19  4:42       ` Baolin Wang
2025-07-14  0:32 ` [PATCH v9 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-07-15  0:39 ` [PATCH v9 00/14] khugepaged: mTHP support Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).