[PATCH v11 00/15] khugepaged: mTHP support

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v11 00/15] khugepaged: mTHP support
@ 2025-09-12  3:27 Nico Pache
  2025-09-12  3:27 ` [PATCH v11 01/15] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
                   ` (16 more replies)
  0 siblings, 17 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:27 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
pages that are occupied (!none/zero). After the PMD scan is done, we do
binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
range. The restriction on max_ptes_none is removed during the scan, to make
sure we account for the whole PMD range. When no mTHP size is enabled, the
legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
by the attempted collapse order to determine how full a mTHP must be to be
eligible for the collapse to occur. If a mTHP collapse is attempted, but
contains swapped out, or shared pages, we don't perform the collapse. It is
now also possible to collapse to mTHPs without requiring the PMD THP size
to be enabled.

When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
mTHP collapses to prevent collapse "creep" behavior. This prevents
constantly promoting mTHPs to the next available size, which would occur
because a collapse introduces more non-zero pages that would satisfy the
promotion condition on subsequent scans.

Patch 1:     Refactor/rename hpage_collapse
Patch 2:     Refactoring to combine madvise_collapse and khugepaged
Patch 3-7:   Generalize khugepaged functions for arbitrary orders
Patch 8:     skip collapsing mTHP to smaller orders
Patch 9-10:  Add per-order mTHP statistics and tracepoints
Patch 11:    Introduce collapse_allowable_orders
Patch 12-14: Introduce bitmap and mTHP collapse support
Patch 15:    Documentation

---------
 Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits
   while monitoring a number of stats and tracepoints. The code is
   available here[1] (Run in legacy mode for these changes and set mthp
   sizes to inherit)
   The summary from my testings was that there was no significant
   regression noticed through this test. In some cases my changes had
   better collapse latencies, and was able to scan more pages in the same
   amount of time/work, but for the most part the results were consistent.
- redis testing. I tested these changes along with my defer changes
  (see followup [2] post for more details). We've decided to get the mTHP
  changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.

V11 Changes:
- Minor nits and cleanup/refactoring
- commit messages updates
- split patches to make them easier to review
- added helper functions for checking allowable orders and max_ptes_none
- simplified anon_vma locking tracking to improve code readability
- Added automatic cap to max_ptes_none for mTHP collapse to prevent "creep"
- changed bitmap tracking to use page-level granularity instead of
  chunk-based this removes inaccuracies/complexity
- Improved documentation and comments throughout the series
- Refactored collapse_single_pmd() for better readability
- Added proper WARN_ON_ONCE() instead of BUG_ON() for better error handling
- reordered patches to make them easier to review
- merge Baolins patch to run khugepaged for all orders now that
  collapse_allowable_orders is available

V10: https://lore.kernel.org/lkml/20250819134205.622806-1-npache@redhat.com/
V9 : https://lore.kernel.org/lkml/20250714003207.113275-1-npache@redhat.com/
V8 : https://lore.kernel.org/lkml/20250702055742.102808-1-npache@redhat.com/
V7 : https://lore.kernel.org/lkml/20250515032226.128900-1-npache@redhat.com/
V6 : https://lore.kernel.org/lkml/20250515030312.125567-1-npache@redhat.com/
V5 : https://lore.kernel.org/lkml/20250428181218.85925-1-npache@redhat.com/
V4 : https://lore.kernel.org/lkml/20250417000238.74567-1-npache@redhat.com/
V3 : https://lore.kernel.org/lkml/20250414220557.35388-1-npache@redhat.com/
V2 : https://lore.kernel.org/lkml/20250211003028.213461-1-npache@redhat.com/
V1 : https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/

A big thanks to everyone that has reviewed, tested, and participated in
the development process. Its been a great experience working with all of
you on this long endeavour.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

Baolin Wang (1):
  khugepaged: run khugepaged for all orders

Dev Jain (1):
  khugepaged: generalize alloc_charge_folio()

Nico Pache (13):
  khugepaged: rename hpage_collapse_* to collapse_*
  introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce collapse_max_ptes_none helper function
  khugepaged: generalize collapse_huge_page for mTHP collapse
  khugepaged: skip collapsing mTHP to smaller orders
  khugepaged: add per-order mTHP collapse failure statistics
  khugepaged: improve tracepoints for mTHP orders
  khugepaged: introduce collapse_allowable_orders helper function
  khugepaged: Introduce mTHP collapse support
  khugepaged: avoid unnecessary mTHP collapse attempts
  Documentation: mm: update the admin guide for mTHP collapse

 Documentation/admin-guide/mm/transhuge.rst |  91 ++-
 include/linux/huge_mm.h                    |   5 +
 include/linux/khugepaged.h                 |   2 +
 include/trace/events/huge_memory.h         |  34 +-
 mm/huge_memory.c                           |  11 +
 mm/khugepaged.c                            | 612 ++++++++++++++-------
 mm/mremap.c                                |   2 +-
 7 files changed, 535 insertions(+), 222 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v11 01/15] khugepaged: rename hpage_collapse_* to collapse_*
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
@ 2025-09-12  3:27 ` Nico Pache
  2025-09-12  3:27 ` [PATCH v11 02/15] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:27 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

The hpage_collapse functions describe functions used by madvise_collapse
and khugepaged. remove the unnecessary hpage prefix to shorten the
function name.

Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 73 ++++++++++++++++++++++++-------------------------
 mm/mremap.c     |  2 +-
 2 files changed, 37 insertions(+), 38 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index af5f5c80fe4e..40fa6e0a6b2d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -401,14 +401,14 @@ void __init khugepaged_destroy(void)
 	kmem_cache_destroy(mm_slot_cache);
 }
 
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int collapse_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 {
-	return hpage_collapse_test_exit(mm) ||
+	return collapse_test_exit(mm) ||
 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
@@ -443,7 +443,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 	int wakeup;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	VM_BUG_ON_MM(collapse_test_exit(mm), mm);
 	if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
 		return;
 
@@ -501,7 +501,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * collapse_test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we return all
 		 * pagetables will be destroyed) until khugepaged has finished
 		 * working on the pagetables under the mmap_lock.
@@ -590,7 +590,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
-		/* See hpage_collapse_scan_pmd(). */
+		/* See collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
@@ -841,7 +841,7 @@ struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool collapse_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -876,7 +876,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -895,7 +895,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -916,7 +916,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
 				 TVA_FORCED_COLLAPSE;
 
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(collapse_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -989,7 +989,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1075,7 +1075,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
-	int node = hpage_collapse_find_target_node(cc);
+	int node = collapse_find_target_node(cc);
 	struct folio *folio;
 
 	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1261,10 +1261,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
-				   struct vm_area_struct *vma,
-				   unsigned long address, bool *mmap_locked,
-				   struct collapse_control *cc)
+static int collapse_scan_pmd(struct mm_struct *mm,
+			     struct vm_area_struct *vma,
+			     unsigned long address, bool *mmap_locked,
+			     struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1372,7 +1372,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1439,7 +1439,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (collapse_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
@@ -1741,7 +1741,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 
-		if (hpage_collapse_test_exit(mm))
+		if (collapse_test_exit(mm))
 			continue;
 		/*
 		 * When a vma is registered with uffd-wp, we cannot recycle
@@ -2263,9 +2263,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
-				    struct file *file, pgoff_t start,
-				    struct collapse_control *cc)
+static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+			      struct file *file, pgoff_t start,
+			      struct collapse_control *cc)
 {
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2320,7 +2320,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			folio_put(folio);
 			break;
@@ -2370,7 +2370,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
+static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
@@ -2408,7 +2408,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(collapse_test_exit_or_disable(mm)))
 		goto breakouterloop;
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2416,7 +2416,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+		if (unlikely(collapse_test_exit_or_disable(mm))) {
 			progress++;
 			break;
 		}
@@ -2437,7 +2437,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(collapse_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2450,12 +2450,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 
 				mmap_read_unlock(mm);
 				mmap_locked = false;
-				*result = hpage_collapse_scan_file(mm,
+				*result = collapse_scan_file(mm,
 					khugepaged_scan.address, file, pgoff, cc);
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
+					if (collapse_test_exit_or_disable(mm))
 						goto breakouterloop;
 					*result = collapse_pte_mapped_thp(mm,
 						khugepaged_scan.address, false);
@@ -2464,7 +2464,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					mmap_read_unlock(mm);
 				}
 			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
+				*result = collapse_scan_pmd(mm, vma,
 					khugepaged_scan.address, &mmap_locked, cc);
 			}
 
@@ -2497,7 +2497,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (collapse_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2550,8 +2550,8 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 			pass_through_head++;
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
-			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &result, cc);
+			progress += collapse_scan_mm_slot(pages - progress,
+							  &result, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2792,12 +2792,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 			mmap_read_unlock(mm);
 			mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
+			result = collapse_scan_file(mm, addr, file, pgoff, cc);
 			fput(file);
 		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+			result = collapse_scan_pmd(mm, vma, addr,
+						   &mmap_locked, cc);
 		}
 		if (!mmap_locked)
 			*lock_dropped = true;
diff --git a/mm/mremap.c b/mm/mremap.c
index a562d8cf1eee..c461758c47f5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -241,7 +241,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		goto out;
 	}
 	/*
-	 * Now new_pte is none, so hpage_collapse_scan_file() path can not find
+	 * Now new_pte is none, so collapse_scan_file() path can not find
 	 * this by traversing file->f_mapping, so there is no concurrency with
 	 * retract_page_tables(). In addition, we already hold the exclusive
 	 * mmap_lock, so this new_pte page is stable, so there is no need to get
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 02/15] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
  2025-09-12  3:27 ` [PATCH v11 01/15] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
@ 2025-09-12  3:27 ` Nico Pache
  2025-09-12  3:27 ` [PATCH v11 03/15] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:27 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.

Create collapse_single_pmd to increase code reuse and create an entry
point to these two users.

Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. This introduces a minor behavioral change
that is most likely an undiscovered bug. The current implementation of
khugepaged tests collapse_test_exit_or_disable before calling
collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
case. By unifying these two callers madvise_collapse now also performs
this check. We also modify the return value to be SCAN_ANY_PROCESS which
properly indicates that this process is no longer valid to operate on.

We also guard the khugepaged_pages_collapsed variable to ensure its only
incremented for khugepaged.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 97 ++++++++++++++++++++++++++-----------------------
 1 file changed, 52 insertions(+), 45 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40fa6e0a6b2d..63d2ba4b2b6d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2370,6 +2370,53 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int collapse_single_pmd(unsigned long addr,
+		struct vm_area_struct *vma, bool *mmap_locked,
+		struct collapse_control *cc)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int result;
+	struct file *file;
+	pgoff_t pgoff;
+
+	if (vma_is_anonymous(vma)) {
+		result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
+		goto end;
+	}
+
+	file = get_file(vma->vm_file);
+	pgoff = linear_page_index(vma, addr);
+
+	mmap_read_unlock(mm);
+	*mmap_locked = false;
+	result = collapse_scan_file(mm, addr, file, pgoff, cc);
+	fput(file);
+	if (result != SCAN_PTE_MAPPED_HUGEPAGE)
+		goto end;
+
+	mmap_read_lock(mm);
+	*mmap_locked = true;
+	if (collapse_test_exit_or_disable(mm)) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+		return SCAN_ANY_PROCESS;
+	}
+	result = collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
+	if (result == SCAN_PMD_MAPPED)
+		result = SCAN_SUCCEED;
+	mmap_read_unlock(mm);
+	*mmap_locked = false;
+
+end:
+	if (cc->is_khugepaged && result == SCAN_SUCCEED)
+		++khugepaged_pages_collapsed;
+	return result;
+}
+
 static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2443,34 +2490,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (!vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
-
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
 
+			*result = collapse_single_pmd(khugepaged_scan.address,
+						      vma, &mmap_locked, cc);
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
@@ -2786,34 +2808,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
 		nodes_clear(cc->alloc_nmask);
-		if (!vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			result = collapse_scan_file(mm, addr, file, pgoff, cc);
-			fput(file);
-		} else {
-			result = collapse_scan_pmd(mm, vma, addr,
-						   &mmap_locked, cc);
-		}
+		result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
+
 		if (!mmap_locked)
 			*lock_dropped = true;
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
-		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			mmap_read_lock(mm);
-			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
+		case SCAN_PTE_MAPPED_HUGEPAGE:
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 03/15] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
  2025-09-12  3:27 ` [PATCH v11 01/15] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
  2025-09-12  3:27 ` [PATCH v11 02/15] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-09-12  3:27 ` Nico Pache
  2025-09-12  3:27 ` [PATCH v11 04/15] khugepaged: generalize alloc_charge_folio() Nico Pache
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:27 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

For khugepaged to support different mTHP orders, we must generalize this
to check if the PMD is not shared by another VMA and that the order is
enabled.

No functional change in this patch. Also correct a comment about the
functionality of the revalidation.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 63d2ba4b2b6d..6dbe2d0683ac 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -903,14 +903,13 @@ static int collapse_find_target_node(struct collapse_control *cc)
 
 /*
  * If mmap_lock temporarily dropped, revalidate vma
- * before taking mmap_lock.
+ * after taking the mmap_lock again.
  * Returns enum scan_result value.
  */
 
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
-				   bool expect_anon,
-				   struct vm_area_struct **vmap,
-				   struct collapse_control *cc)
+		bool expect_anon, struct vm_area_struct **vmap,
+		struct collapse_control *cc, unsigned int order)
 {
 	struct vm_area_struct *vma;
 	enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
@@ -923,15 +922,16 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
+	/* Always check the PMD order to ensure its not shared by another VMA */
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
+	if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order)))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
 	 * remapped to file after khugepaged reaquired the mmap_lock.
 	 *
-	 * thp_vma_allowable_order may return true for qualified file
+	 * thp_vma_allowable_orders may return true for qualified file
 	 * vmas.
 	 */
 	if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
@@ -1127,7 +1127,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1161,7 +1162,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+					 HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2797,7 +2799,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 cc, HPAGE_PMD_ORDER);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 04/15] khugepaged: generalize alloc_charge_folio()
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (2 preceding siblings ...)
  2025-09-12  3:27 ` [PATCH v11 03/15] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-09-12  3:27 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 05/15] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:27 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

From: Dev Jain <dev.jain@arm.com>

Pass order to alloc_charge_folio() and update mTHP statistics.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
 include/linux/huge_mm.h                    |  2 ++
 mm/huge_memory.c                           |  4 ++++
 mm/khugepaged.c                            | 17 +++++++++++------
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 1654211cc6cf..13269a0074d4 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -634,6 +634,14 @@ anon_fault_fallback_charge
 	instead falls back to using huge pages with lower orders or
 	small pages even though the allocation was successful.
 
+collapse_alloc
+	is incremented every time a huge page is successfully allocated for a
+	khugepaged collapse.
+
+collapse_alloc_failed
+	is incremented every time a huge page allocation fails during a
+	khugepaged collapse.
+
 zswpout
 	is incremented every time a huge page is swapped out to zswap in one
 	piece without splitting.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a166be872628..d442f45bd458 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,6 +128,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_COLLAPSE_ALLOC,
+	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6fc669e11c1..76509e3d845b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -620,6 +620,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -685,6 +687,8 @@ static struct attribute *any_stats_attrs[] = {
 #endif
 	&split_attr.attr,
 	&split_failed_attr.attr,
+	&collapse_alloc_attr.attr,
+	&collapse_alloc_failed_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6dbe2d0683ac..2dea49522755 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1071,21 +1071,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+		struct collapse_control *cc, unsigned int order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = collapse_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (order == HPAGE_PMD_ORDER)
+			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
-	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (order == HPAGE_PMD_ORDER)
+		count_vm_event(THP_COLLAPSE_ALLOC);
+	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
@@ -1122,7 +1127,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1850,7 +1855,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 05/15] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (3 preceding siblings ...)
  2025-09-12  3:27 ` [PATCH v11 04/15] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.

mTHP collapse will not honor the khugepaged_max_ptes_shared or
khugepaged_max_ptes_swap parameters, and will fail if it encounters a
shared or swapped entry.

No functional changes in this patch.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 78 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 48 insertions(+), 30 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2dea49522755..b0ae0b63fc9b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -547,17 +547,17 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
 }
 
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
-					unsigned long address,
-					pte_t *pte,
-					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+		unsigned long address, pte_t *pte, struct collapse_control *cc,
+		unsigned int order, struct list_head *compound_pagelist)
 {
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
+	int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+	const unsigned long nr_pages = 1UL << order;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + nr_pages;
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -565,7 +565,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= scaled_max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -593,8 +593,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			/*
+			 * TODO: Support shared pages without leading to further
+			 * mTHP collapses. Currently bringing in new pages via
+			 * shared may cause a future higher order collapse on a
+			 * rescan of the same range.
+			 */
+			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared)) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -687,18 +693,18 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 }
 
 static void __collapse_huge_page_copy_succeeded(pte_t *pte,
-						struct vm_area_struct *vma,
-						unsigned long address,
-						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+		struct vm_area_struct *vma, unsigned long address,
+		spinlock_t *ptl, unsigned int order,
+		struct list_head *compound_pagelist)
 {
-	unsigned long end = address + HPAGE_PMD_SIZE;
+	unsigned long end = address + (PAGE_SIZE << order);
 	struct folio *src, *tmp;
 	pte_t pteval;
 	pte_t *_pte;
 	unsigned int nr_ptes;
+	const unsigned long nr_pages = 1UL << order;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
 	     address += nr_ptes * PAGE_SIZE) {
 		nr_ptes = 1;
 		pteval = ptep_get(_pte);
@@ -751,13 +757,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 }
 
 static void __collapse_huge_page_copy_failed(pte_t *pte,
-					     pmd_t *pmd,
-					     pmd_t orig_pmd,
-					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
+		unsigned int order, struct list_head *compound_pagelist)
 {
 	spinlock_t *pmd_ptl;
-
+	const unsigned long nr_pages = 1UL << order;
 	/*
 	 * Re-establish the PMD to point to the original page table
 	 * entry. Restoring PMD needs to be done prior to releasing
@@ -771,7 +775,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
 }
 
 /*
@@ -791,16 +795,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
  */
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
-		unsigned long address, spinlock_t *ptl,
+		unsigned long address, spinlock_t *ptl, unsigned int order,
 		struct list_head *compound_pagelist)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
-
+	const unsigned long nr_pages = 1UL << order;
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -819,10 +823,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    order, compound_pagelist);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 order, compound_pagelist);
 
 	return result;
 }
@@ -995,13 +999,12 @@ static int check_pmd_still_valid(struct mm_struct *mm,
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
  */
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
-				       struct vm_area_struct *vma,
-				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+		struct vm_area_struct *vma, unsigned long haddr,
+		pmd_t *pmd, int referenced, unsigned int order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = haddr + (PAGE_SIZE << order);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1032,6 +1035,19 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		if (!is_swap_pte(vmf.orig_pte))
 			continue;
 
+		/*
+		 * TODO: Support swapin without leading to further mTHP
+		 * collapses. Currently bringing in new pages via swapin may
+		 * cause a future higher order collapse on a rescan of the same
+		 * range.
+		 */
+		if (order != HPAGE_PMD_ORDER) {
+			pte_unmap(pte);
+			mmap_read_unlock(mm);
+			result = SCAN_EXCEED_SWAP_PTE;
+			goto out;
+		}
+
 		vmf.pte = pte;
 		vmf.ptl = ptl;
 		ret = do_swap_page(&vmf);
@@ -1152,7 +1168,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+						     referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1200,6 +1216,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
+						      HPAGE_PMD_ORDER,
 						      &compound_pagelist);
 		spin_unlock(pte_ptl);
 	} else {
@@ -1230,6 +1247,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
+					   HPAGE_PMD_ORDER,
 					   &compound_pagelist);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (4 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 05/15] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12 13:35   ` Lorenzo Stoakes
  2025-09-12  3:28 ` [PATCH v11 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

The current mechanism for determining mTHP collapse scales the
khugepaged_max_ptes_none value based on the target order. This
introduces an undesirable feedback loop, or "creep", when max_ptes_none
is set to a value greater than HPAGE_PMD_NR / 2.

With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.

To fix this issue introduce a helper function that caps the max_ptes_none
to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
the max_ptes_none number by the (PMD_ORDER - target collapse order).

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0ae0b63fc9b..4587f2def5c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -468,6 +468,26 @@ void __khugepaged_enter(struct mm_struct *mm)
 		wake_up_interruptible(&khugepaged_wait);
 }
 
+/* Returns the scaled max_ptes_none for a given order.
+ * Caps the value to HPAGE_PMD_NR/2 - 1 in the case of mTHP collapse to prevent
+ * a feedback loop. If max_ptes_none is greater than HPAGE_PMD_NR/2, the value
+ * would lead to collapses that introduces 2x more pages than the original
+ * number of pages. On subsequent scans, the max_ptes_none check would be
+ * satisfied and the collapses would continue until the largest order is reached
+ */
+static int collapse_max_ptes_none(unsigned int order)
+{
+	int max_ptes_none;
+
+	if (order != HPAGE_PMD_ORDER &&
+	    khugepaged_max_ptes_none >= HPAGE_PMD_NR/2)
+		max_ptes_none = HPAGE_PMD_NR/2 - 1;
+	else
+		max_ptes_none = khugepaged_max_ptes_none;
+	return max_ptes_none >> (HPAGE_PMD_ORDER - order);
+
+}
+
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
@@ -554,7 +574,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
-	int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+	int scaled_max_ptes_none = collapse_max_ptes_none(order);
 	const unsigned long nr_pages = 1UL << order;
 
 	for (_pte = pte; _pte < pte + nr_pages;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (5 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 08/15] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

Pass an order and offset to collapse_huge_page to support collapsing anon
memory to arbitrary orders within a PMD. order indicates what mTHP size we
are attempting to collapse to, and offset indicates were in the PMD to
start the collapse attempt.

For non-PMD collapse we must leave the anon VMA write locked until after
we collapse the mTHP-- in the PMD case all the pages are isolated, but in
the mTHP case this is not true, and we must keep the lock to prevent
changes to the VMA from occurring.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 123 +++++++++++++++++++++++++++++-------------------
 1 file changed, 74 insertions(+), 49 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4587f2def5c1..248947e78a30 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1139,43 +1139,50 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
-static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+static int collapse_huge_page(struct mm_struct *mm, unsigned long pmd_address,
+		int referenced, int unmapped, struct collapse_control *cc,
+		bool *mmap_locked, unsigned int order, unsigned long offset)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte = NULL, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	bool anon_vma_locked = false;
+	const unsigned long nr_pages = 1UL << order;
+	unsigned long mthp_address = pmd_address + offset * PAGE_SIZE;
 
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(pmd_address & ~HPAGE_PMD_MASK);
 
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
 	 * The allocation can take potentially a long time if it involves
 	 * sync compaction, and we do not need to hold the mmap_lock during
 	 * that. We will recheck the vma after taking it again in write mode.
+	 * If collapsing mTHPs we may have already released the read_lock.
 	 */
-	mmap_read_unlock(mm);
+	if (*mmap_locked) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
 
-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1187,13 +1194,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, mthp_address, pmd,
+						     referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1203,20 +1211,20 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
 	vma_start_write(vma);
-	result = check_pmd_still_valid(mm, address, pmd);
+	result = check_pmd_still_valid(mm, pmd_address, pmd);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
 	anon_vma_lock_write(vma->anon_vma);
+	anon_vma_locked = true;
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, mthp_address,
+				mthp_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
@@ -1228,24 +1236,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * Parallel GUP-fast is fine since GUP-fast will back off when
 	 * it detects PMD is changed.
 	 */
-	_pmd = pmdp_collapse_flush(vma, address, pmd);
+	_pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, mthp_address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      HPAGE_PMD_ORDER,
-						      &compound_pagelist);
+		result = __collapse_huge_page_isolate(vma, mthp_address, pte, cc,
+						      order, &compound_pagelist);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1255,21 +1260,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 */
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
-		anon_vma_unlock_write(vma->anon_vma);
 		goto out_up_write;
 	}
 
 	/*
-	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
+	 * For PMD collapse all pages are isolated and locked so anon_vma
+	 * rmap can't run anymore. For mTHP collapse we must hold the lock
 	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	if (order == HPAGE_PMD_ORDER) {
+		anon_vma_unlock_write(vma->anon_vma);
+		anon_vma_locked = false;
+	}
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   HPAGE_PMD_ORDER,
-					   &compound_pagelist);
-	pte_unmap(pte);
+					   vma, mthp_address, pte_ptl,
+					   order, &compound_pagelist);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
 
@@ -1279,27 +1284,48 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
-	_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
-	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	set_pmd_at(mm, address, pmd, _pmd);
-	update_mmu_cache_pmd(vma, address, pmd);
-	deferred_split_folio(folio, false);
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
+		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+		spin_lock(pmd_ptl);
+		WARN_ON_ONCE(!pmd_none(*pmd));
+		folio_add_new_anon_rmap(folio, vma, pmd_address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, pmd_address, pmd, _pmd);
+		update_mmu_cache_pmd(vma, pmd_address, pmd);
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	} else { /* mTHP collapse */
+		mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		WARN_ON_ONCE(!pmd_none(*pmd));
+		folio_ref_add(folio, nr_pages - 1);
+		folio_add_new_anon_rmap(folio, vma, mthp_address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		set_ptes(vma->vm_mm, mthp_address, pte, mthp_pte, nr_pages);
+		update_mmu_cache_range(NULL, vma, mthp_address, pte, nr_pages);
+
+		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
 out_up_write:
+	if (anon_vma_locked)
+		anon_vma_unlock_write(vma->anon_vma);
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
 out_nolock:
+	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
@@ -1467,9 +1493,8 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+					    unmapped, cc, mmap_locked,
+					    HPAGE_PMD_ORDER, 0);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 08/15] khugepaged: skip collapsing mTHP to smaller orders
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (6 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 09/15] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 248947e78a30..ebcc0c85a0d6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -610,6 +610,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		/*
+		 * TODO: In some cases of partially-mapped folios, we'd actually
+		 * want to collapse.
+		 */
+		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
+
 		/* See collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 09/15] khugepaged: add per-order mTHP collapse failure statistics
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (7 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 08/15] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  9:35   ` Baolin Wang
  2025-09-12  3:28 ` [PATCH v11 10/15] khugepaged: improve tracepoints for mTHP orders Nico Pache
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

Add three new mTHP statistics to track collapse failures for different
orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:

- collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
	PTEs

- collapse_exceed_none_pte: Counts when mTHP collapse fails due to
  	exceeding the none PTE threshold for the given order

- collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
  	PTEs

These statistics complement the existing THP_SCAN_EXCEED_* events by
providing per-order granularity for mTHP collapse attempts. The stats are
exposed via sysfs under
`/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
supported hugepage size.

As we currently dont support collapsing mTHPs that contain a swap or
shared entry, those statistics keep track of how often we are
encountering failed mTHP collapses due to these restrictions.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 23 ++++++++++++++++++++++
 include/linux/huge_mm.h                    |  3 +++
 mm/huge_memory.c                           |  7 +++++++
 mm/khugepaged.c                            | 16 ++++++++++++---
 4 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 13269a0074d4..7c71cda8aea1 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -709,6 +709,29 @@ nr_anon_partially_mapped
        an anonymous THP as "partially mapped" and count it here, even though it
        is not actually partially mapped anymore.
 
+collapse_exceed_none_pte
+       The number of anonymous mTHP pte ranges where the number of none PTEs
+       exceeded the max_ptes_none threshold. For mTHP collapse, khugepaged
+       checks a PMD region and tracks which PTEs are present. It then tries
+       to collapse to the largest enabled mTHP size. The allowed number of empty
+       PTEs is the max_ptes_none threshold scaled by the collapse order. This
+       counter records the number of times a collapse attempt was skipped for
+       this reason, and khugepaged moved on to try the next available mTHP size.
+
+collapse_exceed_swap_pte
+       The number of anonymous mTHP pte ranges which contain at least one swap
+       PTE. Currently khugepaged does not support collapsing mTHP regions
+       that contain a swap PTE. This counter can be used to monitor the
+       number of khugepaged mTHP collapses that failed due to the presence
+       of a swap PTE.
+
+collapse_exceed_shared_pte
+       The number of anonymous mTHP pte ranges which contain at least one shared
+       PTE. Currently khugepaged does not support collapsing mTHP pte ranges
+       that contain a shared PTE. This counter can be used to monitor the
+       number of khugepaged mTHP collapses that failed due to the presence
+       of a shared PTE.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d442f45bd458..990622c96c8b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -144,6 +144,9 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT_DEFERRED,
 	MTHP_STAT_NR_ANON,
 	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+	MTHP_STAT_COLLAPSE_EXCEED_NONE,
+	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76509e3d845b..07ea9aafd64c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -638,6 +638,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
 DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
 DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -654,6 +658,9 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_deferred_attr.attr,
 	&nr_anon_attr.attr,
 	&nr_anon_partially_mapped_attr.attr,
+	&collapse_exceed_swap_pte_attr.attr,
+	&collapse_exceed_none_pte_attr.attr,
+	&collapse_exceed_shared_pte_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ebcc0c85a0d6..8abbe6e4317a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -589,7 +589,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				if (order == HPAGE_PMD_ORDER)
+					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
 				goto out;
 			}
 		}
@@ -628,10 +630,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			 * shared may cause a future higher order collapse on a
 			 * rescan of the same range.
 			 */
-			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared)) {
+			if (order != HPAGE_PMD_ORDER) {
+				result = SCAN_EXCEED_SHARED_PTE;
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+				goto out;
+			}
+
+			if (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
 				goto out;
 			}
 		}
@@ -1071,6 +1080,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		 * range.
 		 */
 		if (order != HPAGE_PMD_ORDER) {
+			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
 			pte_unmap(pte);
 			mmap_read_unlock(mm);
 			result = SCAN_EXCEED_SWAP_PTE;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 10/15] khugepaged: improve tracepoints for mTHP orders
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (8 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 09/15] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 11/15] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to
give better insight into what order is being operated at for.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
 mm/khugepaged.c                    |  9 ++++----
 2 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index dd94d14a2427..19d99b2549e6 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -88,40 +88,44 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 TRACE_EVENT(mm_collapse_huge_page,
 
-	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+	TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order),
 
-	TP_ARGS(mm, isolated, status),
+	TP_ARGS(mm, isolated, status, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, isolated)
 		__field(int, status)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->isolated = isolated;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, isolated=%d, status=%s",
+	TP_printk("mm=%p, isolated=%d, status=%s order=%u",
 		__entry->mm,
 		__entry->isolated,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct folio *folio, int none_or_zero,
-		 int referenced, int status),
+		 int referenced, int status, unsigned int order),
 
-	TP_ARGS(folio, none_or_zero, referenced, status),
+	TP_ARGS(folio, none_or_zero, referenced, status, order),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
 		__field(int, none_or_zero)
 		__field(int, referenced)
 		__field(int, status)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
@@ -129,26 +133,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->none_or_zero = none_or_zero;
 		__entry->referenced = referenced;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s order=%u",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_swapin,
 
-	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+		 unsigned int order),
 
-	TP_ARGS(mm, swapped_in, referenced, ret),
+	TP_ARGS(mm, swapped_in, referenced, ret, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, swapped_in)
 		__field(int, referenced)
 		__field(int, ret)
+		__field(unsigned int, order)
 	),
 
 	TP_fast_assign(
@@ -156,13 +164,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->swapped_in = swapped_in;
 		__entry->referenced = referenced;
 		__entry->ret = ret;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u",
 		__entry->mm,
 		__entry->swapped_in,
 		__entry->referenced,
-		__entry->ret)
+		__entry->ret,
+		__entry->order)
 );
 
 TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8abbe6e4317a..5b45ef575446 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -720,13 +720,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-						    referenced, result);
+						    referenced, result, order);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-					    referenced, result);
+					    referenced, result, order);
 	return result;
 }
 
@@ -1121,7 +1121,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 	result = SCAN_SUCCEED;
 out:
-	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+					   order);
 	return result;
 }
 
@@ -1347,7 +1348,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long pmd_address,
 	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
-	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
 	return result;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 11/15] khugepaged: introduce collapse_allowable_orders helper function
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (9 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 10/15] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  9:24   ` Baolin Wang
  2025-09-12  3:28 ` [PATCH v11 12/15] khugepaged: Introduce mTHP collapse support Nico Pache
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

Add collapse_allowable_orders() to generalize THP order eligibility. The
function determines which THP orders are permitted based on collapse
context (khugepaged vs madv_collapse).

This consolidates collapse configuration logic and provides a clean
interface for future mTHP collapse support where the orders may be
different.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5b45ef575446..d224fa97281a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -485,7 +485,16 @@ static int collapse_max_ptes_none(unsigned int order)
 	else
 		max_ptes_none = khugepaged_max_ptes_none;
 	return max_ptes_none >> (HPAGE_PMD_ORDER - order);
+}
+
+/* Check what orders are allowed based on the vma and collapse type */
+static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
+			vm_flags_t vm_flags, bool is_khugepaged)
+{
+	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
+	unsigned long orders = BIT(HPAGE_PMD_ORDER);
 
+	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
 
 void khugepaged_enter_vma(struct vm_area_struct *vma,
@@ -493,7 +502,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+		if (collapse_allowable_orders(vma, vm_flags, true))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -2557,7 +2566,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!collapse_allowable_orders(vma, vma->vm_flags, true)) {
 skip:
 			progress++;
 			continue;
@@ -2865,7 +2874,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
 
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+	if (!collapse_allowable_orders(vma, vma->vm_flags, false))
 		return -EINVAL;
 
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 12/15] khugepaged: Introduce mTHP collapse support
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (10 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 11/15] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 13/15] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

During PMD range scanning, track occupied pages in a bitmap. If mTHPs are
enabled we remove the restriction of max_ptes_none during the scan phase
to avoid missing potential mTHP candidates.

Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
and determine the best eligible order for the collapse. A stack struct is
used instead of traditional recursion. The algorithm splits the bitmap
into smaller chunks to find the best fit mTHP.  max_ptes_none is scaled by
the attempted collapse order to determine how "full" an order must be
before being considered for collapse.

Once we determine what mTHP sizes fits best in that PMD range a collapse
is attempted. A minimum collapse order of 2 is used as this is the lowest
order supported by anon memory.

mTHP collapses reject regions containing swapped out or shared pages.
This is because adding new entries can lead to new none pages, and these
may lead to constant promotion into a higher order (m)THP. A similar
issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
introducing at least 2x the number of pages, and on a future scan will
satisfy the promotion condition once again. This issue is prevented via
the collapse_allowable_orders() function.

Currently madv_collapse is not supported and will only attempt PMD
collapse.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/khugepaged.h |   2 +
 mm/khugepaged.c            | 123 ++++++++++++++++++++++++++++++++++---
 2 files changed, 116 insertions(+), 9 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..179ce716e769 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
+#define KHUGEPAGED_MIN_MTHP_ORDER	2
+#define MAX_MTHP_BITMAP_STACK	(1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
 
 #include <linux/mm.h>
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d224fa97281a..8455a02dc3d6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -93,6 +93,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+struct scan_bit_state {
+	u8 order;
+	u16 offset;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -101,6 +106,13 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/*
+	 * bitmap used to collapse mTHP sizes.
+	 */
+	 DECLARE_BITMAP(mthp_bitmap, HPAGE_PMD_NR);
+	 DECLARE_BITMAP(mthp_bitmap_mask, HPAGE_PMD_NR);
+	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_STACK];
 };
 
 /**
@@ -1361,6 +1373,85 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long pmd_address,
 	return result;
 }
 
+static void push_mthp_bitmap_stack(struct collapse_control *cc, int *top,
+				   u8 order, u16 offset)
+{
+	cc->mthp_bitmap_stack[++*top] = (struct scan_bit_state)
+		{ order, offset };
+}
+
+/*
+ * collapse_scan_bitmap() consumes the bitmap that is generated during
+ * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
+ *
+ * Each bit in the bitmap represents a single occupied (!none/zero) page.
+ * A stack structure cc->mthp_bitmap_stack is used to check different regions
+ * of the bitmap for collapse eligibility. We start at the PMD order and
+ * check if it is eligible for collapse; if not, we add two entries to the
+ * stack at a lower order to represent the left and right halves of the region.
+ *
+ * For each region, we calculate the number of set bits and compare it
+ * against a threshold derived from collapse_max_ptes_none(). A region is
+ * eligible if the number of set bits exceeds this threshold.
+ */
+static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
+		int referenced, int unmapped, struct collapse_control *cc,
+		bool *mmap_locked, unsigned long enabled_orders)
+{
+	u8 order, next_order;
+	u16 offset, mid_offset;
+	int num_chunks;
+	int bits_set, threshold_bits;
+	int top = -1;
+	int collapsed = 0;
+	int ret;
+	struct scan_bit_state state;
+	unsigned int max_none_ptes;
+
+	push_mthp_bitmap_stack(cc, &top, HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0);
+
+	while (top >= 0) {
+		state = cc->mthp_bitmap_stack[top--];
+		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
+		offset = state.offset;
+		num_chunks = 1UL << order;
+
+		/* Skip mTHP orders that are not enabled */
+		if (!test_bit(order, &enabled_orders))
+			goto next_order;
+
+		max_none_ptes = collapse_max_ptes_none(order);
+
+		/* Calculate weight of the range */
+		bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);
+		bitmap_set(cc->mthp_bitmap_mask, offset, num_chunks);
+		bits_set = bitmap_weight_and(cc->mthp_bitmap,
+					     cc->mthp_bitmap_mask, HPAGE_PMD_NR);
+
+		threshold_bits = (1UL << order) - max_none_ptes - 1;
+
+		/* Check if the region is eligible based on the threshold */
+		if (bits_set > threshold_bits) {
+			ret = collapse_huge_page(mm, address, referenced,
+						 unmapped, cc, mmap_locked,
+						 order, offset);
+			if (ret == SCAN_SUCCEED) {
+				collapsed += 1UL << order;
+				continue;
+			}
+		}
+
+next_order:
+		if (state.order > 0) {
+			next_order = state.order - 1;
+			mid_offset = offset + (num_chunks / 2);
+			push_mthp_bitmap_stack(cc, &top, next_order, mid_offset);
+			push_mthp_bitmap_stack(cc, &top, next_order, offset);
+		}
+	}
+	return collapsed;
+}
+
 static int collapse_scan_pmd(struct mm_struct *mm,
 			     struct vm_area_struct *vma,
 			     unsigned long address, bool *mmap_locked,
@@ -1368,30 +1459,39 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
+	int i;
 	int result = SCAN_FAIL, referenced = 0;
-	int none_or_zero = 0, shared = 0;
+	int none_or_zero = 0, shared = 0, nr_collapsed = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long _address;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
-
+	bool is_pmd_only;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged);
+
+	is_pmd_only = enabled_orders == _BITUL(HPAGE_PMD_ORDER);
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, _address += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		_pte = pte + i;
+		_address = address + i * PAGE_SIZE;
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
@@ -1416,8 +1516,8 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			    (!cc->is_khugepaged || !is_pmd_only ||
+				none_or_zero <= khugepaged_max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1425,6 +1525,8 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 				goto out_unmap;
 			}
 		}
+		/* Set bit for occupied pages */
+		bitmap_set(cc->mthp_bitmap, i, 1);
 		if (pte_uffd_wp(pteval)) {
 			/*
 			 * Don't collapse the page if any of the small
@@ -1521,9 +1623,12 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc, mmap_locked,
-					    HPAGE_PMD_ORDER, 0);
+		nr_collapsed = collapse_scan_bitmap(mm, address, referenced, unmapped,
+					      cc, mmap_locked, enabled_orders);
+		if (nr_collapsed > 0)
+			result = SCAN_SUCCEED;
+		else
+			result = SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 13/15] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (11 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 12/15] khugepaged: Introduce mTHP collapse support Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 14/15] khugepaged: run khugepaged for all orders Nico Pache
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8455a02dc3d6..ead07ccac351 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1435,10 +1435,39 @@ static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
 			ret = collapse_huge_page(mm, address, referenced,
 						 unmapped, cc, mmap_locked,
 						 order, offset);
-			if (ret == SCAN_SUCCEED) {
+
+			/*
+			 * Analyze failure reason to determine next action:
+			 * - goto next_order: try smaller orders in same region
+			 * - continue: try other regions at same order
+			 * - break: stop all attempts (system-wide failure)
+			 */
+			switch (ret) {
+			/* Cases were we should continue to the next region */
+			case SCAN_SUCCEED:
 				collapsed += 1UL << order;
+				fallthrough;
+			case SCAN_PTE_MAPPED_HUGEPAGE:
 				continue;
+			/* Cases were lower orders might still succeed */
+			case SCAN_LACK_REFERENCED_PAGE:
+			case SCAN_EXCEED_NONE_PTE:
+			case SCAN_EXCEED_SWAP_PTE:
+			case SCAN_EXCEED_SHARED_PTE:
+			case SCAN_PAGE_LOCK:
+			case SCAN_PAGE_COUNT:
+			case SCAN_PAGE_LRU:
+			case SCAN_PAGE_NULL:
+			case SCAN_DEL_PAGE_LRU:
+			case SCAN_PTE_NON_PRESENT:
+			case SCAN_PTE_UFFD_WP:
+			case SCAN_ALLOC_HUGE_PAGE_FAIL:
+				goto next_order;
+			/* All other cases should stop collapse attempts */
+			default:
+				break;
 			}
+			break;
 		}
 
 next_order:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 14/15] khugepaged: run khugepaged for all orders
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (12 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 13/15] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  3:28 ` [PATCH v11 15/15] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

From: Baolin Wang <baolin.wang@linux.alibaba.com>

If any order (m)THP is enabled we should allow running khugepaged to
attempt scanning and collapsing mTHPs. In order for khugepaged to operate
when only mTHP sizes are specified in sysfs, we must modify the predicate
function that determines whether it ought to run to do so.

This function is currently called hugepage_pmd_enabled(), this patch
renames it to hugepage_enabled() and updates the logic to check to
determine whether any valid orders may exist which would justify
khugepaged running.

We must also update collapse_allowable_orders() to check all orders if
the vma is anonymous and the collapse is khugepaged.

After this patch khugepaged mTHP collapse is fully enabled.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ead07ccac351..1c7f3224234e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -424,23 +424,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
 	 * hugepages, when configured in, are determined by the global control.
-	 * Anon pmd-sized hugepages are determined by the pmd-size control.
+	 * Anon hugepages are determined by its per-size mTHP control.
 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
 	 */
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+	if (READ_ONCE(huge_anon_orders_always))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+	if (READ_ONCE(huge_anon_orders_madvise))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+	if (READ_ONCE(huge_anon_orders_inherit) &&
 	    hugepage_global_enabled())
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -504,7 +504,8 @@ static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
 			vm_flags_t vm_flags, bool is_khugepaged)
 {
 	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
-	unsigned long orders = BIT(HPAGE_PMD_ORDER);
+	unsigned long orders = is_khugepaged && vma_is_anonymous(vma) ?
+				THP_ORDERS_ALL_ANON : BIT(HPAGE_PMD_ORDER);
 
 	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
@@ -513,7 +514,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
-	    hugepage_pmd_enabled()) {
+	    hugepage_enabled()) {
 		if (collapse_allowable_orders(vma, vm_flags, true))
 			__khugepaged_enter(vma->vm_mm);
 	}
@@ -2776,7 +2777,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+	return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
 }
 
 static int khugepaged_wait_event(void)
@@ -2849,7 +2850,7 @@ static void khugepaged_wait_work(void)
 		return;
 	}
 
-	if (hugepage_pmd_enabled())
+	if (hugepage_enabled())
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
@@ -2880,7 +2881,7 @@ static void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!hugepage_enabled()) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -2930,7 +2931,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (hugepage_enabled()) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2956,7 +2957,7 @@ int start_stop_khugepaged(void)
 void khugepaged_min_free_kbytes_update(void)
 {
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled() && khugepaged_thread)
+	if (hugepage_enabled() && khugepaged_thread)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v11 15/15] Documentation: mm: update the admin guide for mTHP collapse
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (13 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 14/15] khugepaged: run khugepaged for all orders Nico Pache
@ 2025-09-12  3:28 ` Nico Pache
  2025-09-12  8:43 ` [PATCH v11 00/15] khugepaged: mTHP support Lorenzo Stoakes
  2025-09-12 12:19 ` Kiryl Shutsemau
  16 siblings, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12  3:28 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kas, aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato, Bagas Sanjaya

Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidence on how to utilize it.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 60 +++++++++++++---------
 1 file changed, 37 insertions(+), 23 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 7c71cda8aea1..b3da713f7837 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,8 @@ often.
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages of either PMD size
+or mTHP sizes, if the system is configured to do so
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -212,17 +213,17 @@ PMD-mappable transparent hugepage::
 All THPs at fault and collapse time will be added to _deferred_list,
 and will therefore be split under memory presure if they are considered
 "underused". A THP is underused if the number of zero-filled pages in
-the THP is above max_ptes_none (see below). It is possible to disable
-this behaviour by writing 0 to shrink_underused, and enable it by writing
-1 to it::
+the THP is above max_ptes_none (see below) scaled by the THP order. It is
+possible to disable this behaviour by writing 0 to shrink_underused, and enable
+it by writing 1 to it::
 
 	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 
-khugepaged will be automatically started when PMD-sized THP is enabled
+khugepaged will be automatically started when any THP size is enabled
 (either of the per-size anon control or the top-level control are set
 to "always" or "madvise"), and it'll be automatically shutdown when
-PMD-sized THP is disabled (when both the per-size anon control and the
+all THP sizes are disabled (when both the per-size anon control and the
 top-level control are "never")
 
 process THP controls
@@ -264,11 +265,6 @@ support the following arguments::
 Khugepaged controls
 -------------------
 
-.. note::
-   khugepaged currently only searches for opportunities to collapse to
-   PMD-sized THP and no attempt is made to collapse to other THP
-   sizes.
-
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
@@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
 The khugepaged progress can be seen in the number of pages collapsed (note
 that this counter may not be an exact count of the number of pages
 collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
-being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
-one 2M hugepage. Each may happen independently, or together, depending on
-the type of memory and the failures that occur. As such, this value should
-be interpreted roughly as a sign of progress, and counters in /proc/vmstat
-consulted for more accurate accounting)::
+being replaced by a PMD mapping, or (2) physical pages replaced by one
+hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
+or together, depending on the type of memory and the failures that occur.
+As such, this value should be interpreted roughly as a sign of progress,
+and counters in /proc/vmstat consulted for more accurate accounting)::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
 
@@ -308,16 +304,25 @@ for each pass::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
 
-``max_ptes_none`` specifies how many extra small pages (that are
-not already mapped) can be allocated when collapsing a group
-of small pages into one large page::
+``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
+when collapsing a group of small pages into one large page. This parameter
+is scaled by the page order of the attempted collapse to determine eligibility::
 
 	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
 
-A higher value leads to use additional memory for programs.
-A lower value leads to gain less thp performance. Value of
-max_ptes_none can waste cpu time very little, you can
-ignore it.
+For PMD-sized THP collapse, this directly limits the number of empty pages
+allowed in the 2MB region. For mTHP collapse, the threshold is scaled by
+the order (e.g., for 64K mTHP, the threshold is max_ptes_none >> 4).
+
+To prevent "creeping" behavior where collapses continuously promote to larger
+orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
+capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
+that introducing more than half of the pages to be non-zero it will always
+satisfy the eligibility check on the next scan and the region will be collapse.
+
+A higher value allows more empty pages, potentially leading to more memory
+usage but better THP performance. A lower value is more conservative and
+may result in fewer THP collapses.
 
 ``max_ptes_swap`` specifies how many pages can be brought in from
 swap when collapsing a group of pages into a transparent huge page::
@@ -337,6 +342,15 @@ that THP is shared. Exceeding the number would block the collapse::
 
 A higher value may increase memory footprint for some workloads.
 
+.. note::
+   For mTHP collapse, khugepaged does not support collapsing regions that
+   contain shared or swapped out pages, as this could lead to continuous
+   promotion to higher orders. The collapse will fail if any shared or
+   swapped PTEs are encountered during the scan.
+
+   Currently, madvise_collapse only supports collapsing to PMD-sized THPs
+   and does not attempt mTHP collapses.
+
 Boot parameters
 ===============
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (14 preceding siblings ...)
  2025-09-12  3:28 ` [PATCH v11 15/15] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2025-09-12  8:43 ` Lorenzo Stoakes
  2025-09-12 12:19 ` Kiryl Shutsemau
  16 siblings, 0 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12  8:43 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kas, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh, pfalcato

Hi Nico,

I will take a look at this, but just a brief hint/plea to maybe relax a bit
on the respins, there was some ongoing discussion on the v10 yesterday, and
then today a new, huge v11 comes along :)

THP workload has been crazy this cycle, and this series in particular needs
a lot of attention, so it'd be good perhaps to make sure everything is
replied to and give it a day or two at least before sending out the next
revision.

I feel like we probably need to make some changes to how THP works as a
whole, perhaps mandating no series in the merge window, or something to
help even things out a little as I feel David and I have been somewhat
overwhelmed this cycle, and we do need to be sensible with how we handle
workload.

(Plus I'm going to Kernel Recipes and then on leave for a couple weeks
after that soon so I am hoping here that David isn't left with too much
workload alone for that time also :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 11/15] khugepaged: introduce collapse_allowable_orders helper function
  2025-09-12  3:28 ` [PATCH v11 11/15] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
@ 2025-09-12  9:24   ` Baolin Wang
  0 siblings, 0 replies; 79+ messages in thread
From: Baolin Wang @ 2025-09-12  9:24 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kas, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh, pfalcato



On 2025/9/12 11:28, Nico Pache wrote:
> Add collapse_allowable_orders() to generalize THP order eligibility. The
> function determines which THP orders are permitted based on collapse
> context (khugepaged vs madv_collapse).
> 
> This consolidates collapse configuration logic and provides a clean
> interface for future mTHP collapse support where the orders may be
> different.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> ---
>   mm/khugepaged.c | 15 ++++++++++++---
>   1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5b45ef575446..d224fa97281a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -485,7 +485,16 @@ static int collapse_max_ptes_none(unsigned int order)
>   	else
>   		max_ptes_none = khugepaged_max_ptes_none;
>   	return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> +}
> +
> +/* Check what orders are allowed based on the vma and collapse type */
> +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> +			vm_flags_t vm_flags, bool is_khugepaged)
> +{
> +	enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> +	unsigned long orders = BIT(HPAGE_PMD_ORDER);
>   
> +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>   }
>   
>   void khugepaged_enter_vma(struct vm_area_struct *vma,
> @@ -493,7 +502,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>   {
>   	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>   	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (collapse_allowable_orders(vma, vm_flags, true))
>   			__khugepaged_enter(vma->vm_mm);
>   	}
>   }
> @@ -2557,7 +2566,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>   			progress++;
>   			break;
>   		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!collapse_allowable_orders(vma, vma->vm_flags, true)) {
>   skip:
>   			progress++;
>   			continue;
> @@ -2865,7 +2874,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   	BUG_ON(vma->vm_start > start);
>   	BUG_ON(vma->vm_end < end);
>   
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> +	if (!collapse_allowable_orders(vma, vma->vm_flags, false))
>   		return -EINVAL;
>   
>   	cc = kmalloc(sizeof(*cc), GFP_KERNEL);


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 09/15] khugepaged: add per-order mTHP collapse failure statistics
  2025-09-12  3:28 ` [PATCH v11 09/15] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
@ 2025-09-12  9:35   ` Baolin Wang
  0 siblings, 0 replies; 79+ messages in thread
From: Baolin Wang @ 2025-09-12  9:35 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kas, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh, pfalcato



On 2025/9/12 11:28, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> 
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> 	PTEs
> 
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>    	exceeding the none PTE threshold for the given order
> 
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
>    	PTEs
> 
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
> 
> As we currently dont support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
                   ` (15 preceding siblings ...)
  2025-09-12  8:43 ` [PATCH v11 00/15] khugepaged: mTHP support Lorenzo Stoakes
@ 2025-09-12 12:19 ` Kiryl Shutsemau
  2025-09-12 12:25   ` David Hildenbrand
  2025-09-12 13:47   ` David Hildenbrand
  16 siblings, 2 replies; 79+ messages in thread
From: Kiryl Shutsemau @ 2025-09-12 12:19 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> pages that are occupied (!none/zero). After the PMD scan is done, we do
> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> range. The restriction on max_ptes_none is removed during the scan, to make
> sure we account for the whole PMD range. When no mTHP size is enabled, the
> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> by the attempted collapse order to determine how full a mTHP must be to be
> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> contains swapped out, or shared pages, we don't perform the collapse. It is
> now also possible to collapse to mTHPs without requiring the PMD THP size
> to be enabled.
> 
> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> mTHP collapses to prevent collapse "creep" behavior. This prevents
> constantly promoting mTHPs to the next available size, which would occur
> because a collapse introduces more non-zero pages that would satisfy the
> promotion condition on subsequent scans.

Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 12:19 ` Kiryl Shutsemau
@ 2025-09-12 12:25   ` David Hildenbrand
  2025-09-12 13:37     ` Johannes Weiner
  2025-09-12 23:31     ` Nico Pache
  2025-09-12 13:47   ` David Hildenbrand
  1 sibling, 2 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 12:25 UTC (permalink / raw)
  To: Kiryl Shutsemau, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 12.09.25 14:19, Kiryl Shutsemau wrote:
> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>> The following series provides khugepaged with the capability to collapse
>> anonymous memory regions to mTHPs.
>>
>> To achieve this we generalize the khugepaged functions to no longer depend
>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>> range. The restriction on max_ptes_none is removed during the scan, to make
>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>> by the attempted collapse order to determine how full a mTHP must be to be
>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>> contains swapped out, or shared pages, we don't perform the collapse. It is
>> now also possible to collapse to mTHPs without requiring the PMD THP size
>> to be enabled.
>>
>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>> constantly promoting mTHPs to the next available size, which would occur
>> because a collapse introduces more non-zero pages that would satisfy the
>> promotion condition on subsequent scans.
> 
> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> 

I am all for not adding any more ugliness on top of all the ugliness we 
added in the past.

I will soon propose deprecating that parameter in favor of something 
that makes a bit more sense.

In essence, we'll likely have an "eagerness" parameter that ranges from 
0 to 10. 10 is essentially "always collapse" and 0 "never collapse if 
not all is populated".

In between we will have more flexibility on how to set these values.

Likely 9 will be around 50% to not even motivate the user to set 
something that does not make sense (creep).

Of course, the old parameter will have to stick around in compat mode.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function
  2025-09-12  3:28 ` [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
@ 2025-09-12 13:35   ` Lorenzo Stoakes
  2025-09-12 23:26     ` Nico Pache
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 13:35 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kas, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh, pfalcato

On Thu, Sep 11, 2025 at 09:28:01PM -0600, Nico Pache wrote:
> The current mechanism for determining mTHP collapse scales the
> khugepaged_max_ptes_none value based on the target order. This
> introduces an undesirable feedback loop, or "creep", when max_ptes_none
> is set to a value greater than HPAGE_PMD_NR / 2.
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that caps the max_ptes_none
> to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> the max_ptes_none number by the (PMD_ORDER - target collapse order).

I would say very clearly that this is only in the mTHP case.


>
> Signed-off-by: Nico Pache <npache@redhat.com>

Hmm I thought we were going to wait for David to investigate different
approaches to this?

This is another issue with quickly going to another iteration. Though I do think
David explicitly said he'd come back with a solution?

So I'm not sure why we're seeing this solution here? Unless I'm missing
something?

> ---
>  mm/khugepaged.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b0ae0b63fc9b..4587f2def5c1 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -468,6 +468,26 @@ void __khugepaged_enter(struct mm_struct *mm)
>  		wake_up_interruptible(&khugepaged_wait);
>  }
>
> +/* Returns the scaled max_ptes_none for a given order.

We don't start comments at the /*, please use a normal comment format like:

/*
 * xxxx
 */

> + * Caps the value to HPAGE_PMD_NR/2 - 1 in the case of mTHP collapse to prevent

This is super unclear.

It start with 'caps the xxx' which seems like you're talking generally.

You should say very clearly 'For PMD allocations we apply the
khugepaged_max_ptes_none parameter as normal. For mTHP ... [details about mTHP].

> + * a feedback loop. If max_ptes_none is greater than HPAGE_PMD_NR/2, the value
> + * would lead to collapses that introduces 2x more pages than the original
> + * number of pages. On subsequent scans, the max_ptes_none check would be
> + * satisfied and the collapses would continue until the largest order is reached
> + */

This is a super vauge explanation. Please describe the issue with creep more
clearly.

Also aren't we saying that 511 or 0 are the sensible choices? But now somehow
that's not the case?

You're also not giving a kdoc info on what this returns.

> +static int collapse_max_ptes_none(unsigned int order)

It's a problem that existed already, but khugepaged_max_ptes_none is an unsigned
int and this returns int.

Maybe we should fix this while we're at it...

> +{
> +	int max_ptes_none;
> +
> +	if (order != HPAGE_PMD_ORDER &&
> +	    khugepaged_max_ptes_none >= HPAGE_PMD_NR/2)
> +		max_ptes_none = HPAGE_PMD_NR/2 - 1;
> +	else
> +		max_ptes_none = khugepaged_max_ptes_none;
> +	return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> +
> +}
> +

I really don't like this formulation, you're making it unnecessarily unclear and
now, for the super common case of PMD size, you have to figure out 'oh it's this
second branch and we're subtracting HPAGE_PMD_ORDER from HPAGE_PMD_ORDER so just
return khugepaged_max_ptes_none'. When we could... just return it no?

So something like:

#define MAX_PTES_NONE_MTHP_CAP (HPAGE_PMD_NR / 2 - 1)

static unsigned int collapse_max_ptes_none(unsigned int order)
{
	unsigned int max_ptes_none_pmd;

	/* PMD-sized THPs behave precisely the same as before. */
	if (order == HPAGE_PMD_ORDER)
		return khugepaged_max_ptes_none;

	/*
	* Bizarrely, this is expressed in terms of PTEs were this PMD-sized.
	* For the reasons stated above, we cap this value in the case of mTHP.
	*/
	max_ptes_none_pmd = MIN(MAX_PTES_NONE_MTHP_CAP,
		khugepaged_max_ptes_none);

	/* Apply PMD -> mTHP scaling. */
	return max_ptes_none >> (HPAGE_PMD_ORDER - order);
}

>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
> @@ -554,7 +574,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	struct folio *folio = NULL;
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> -	int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> +	int scaled_max_ptes_none = collapse_max_ptes_none(order);
>  	const unsigned long nr_pages = 1UL << order;
>
>  	for (_pte = pte; _pte < pte + nr_pages;
> --
> 2.51.0
>

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 12:25   ` David Hildenbrand
@ 2025-09-12 13:37     ` Johannes Weiner
  2025-09-12 13:46       ` David Hildenbrand
  2025-09-12 23:31     ` Nico Pache
  1 sibling, 1 reply; 79+ messages in thread
From: Johannes Weiner @ 2025-09-12 13:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, rientjes,
	mhocko, rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt,
	jannh, pfalcato

On Fri, Sep 12, 2025 at 02:25:31PM +0200, David Hildenbrand wrote:
> On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> >> The following series provides khugepaged with the capability to collapse
> >> anonymous memory regions to mTHPs.
> >>
> >> To achieve this we generalize the khugepaged functions to no longer depend
> >> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >> pages that are occupied (!none/zero). After the PMD scan is done, we do
> >> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> >> range. The restriction on max_ptes_none is removed during the scan, to make
> >> sure we account for the whole PMD range. When no mTHP size is enabled, the
> >> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> >> by the attempted collapse order to determine how full a mTHP must be to be
> >> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> >> contains swapped out, or shared pages, we don't perform the collapse. It is
> >> now also possible to collapse to mTHPs without requiring the PMD THP size
> >> to be enabled.
> >>
> >> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> >> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> >> mTHP collapses to prevent collapse "creep" behavior. This prevents
> >> constantly promoting mTHPs to the next available size, which would occur
> >> because a collapse introduces more non-zero pages that would satisfy the
> >> promotion condition on subsequent scans.
> > 
> > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > 
> 
> I am all for not adding any more ugliness on top of all the ugliness we 
> added in the past.
> 
> I will soon propose deprecating that parameter in favor of something 
> that makes a bit more sense.
> 
> In essence, we'll likely have an "eagerness" parameter that ranges from 
> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if 
> not all is populated".
> 
> In between we will have more flexibility on how to set these values.
> 
> Likely 9 will be around 50% to not even motivate the user to set 
> something that does not make sense (creep).

One observation we've had from production experiments is that the
optimal number here isn't static. If you have plenty of memory, then
even very sparse THPs are beneficial.

An extreme example: if all your THPs have 2/512 pages populated,
that's still cutting TLB pressure in half!

So in the absence of memory pressure, allocating and collapsing should
optimally be aggressive even on very sparse regions.

On the flipside, if there is memory pressure, TLB benefits are very
quickly drowned out by faults and paging events. And I mean real
memory pressure. If all that's happening is that somebody is streaming
through filesystem data, the optimal behavior is still to be greedy.

Another consideration is that if we need to break large folios, we
should start with colder ones that provide less benefit, and defer the
splitting of hotter ones as long as possible.

Maybe a good direction would be to move splitting out of the shrinker
and tie it to the (refault-aware) anon reclaim. And then instead of a
fixed population threshold, collapse on a pressure gradient that
starts with "no pressure/thrashing and at least two base pages in THP
a region" and ends with "reclaim is splitting everything, back off".

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 13:37     ` Johannes Weiner
@ 2025-09-12 13:46       ` David Hildenbrand
  2025-09-12 14:01         ` Lorenzo Stoakes
                           ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 13:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, rientjes,
	mhocko, rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt,
	jannh, pfalcato

On 12.09.25 15:37, Johannes Weiner wrote:
> On Fri, Sep 12, 2025 at 02:25:31PM +0200, David Hildenbrand wrote:
>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>>> The following series provides khugepaged with the capability to collapse
>>>> anonymous memory regions to mTHPs.
>>>>
>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>>> by the attempted collapse order to determine how full a mTHP must be to be
>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>>> to be enabled.
>>>>
>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>>> constantly promoting mTHPs to the next available size, which would occur
>>>> because a collapse introduces more non-zero pages that would satisfy the
>>>> promotion condition on subsequent scans.
>>>
>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>>>
>>
>> I am all for not adding any more ugliness on top of all the ugliness we
>> added in the past.
>>
>> I will soon propose deprecating that parameter in favor of something
>> that makes a bit more sense.
>>
>> In essence, we'll likely have an "eagerness" parameter that ranges from
>> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
>> not all is populated".
>>
>> In between we will have more flexibility on how to set these values.
>>
>> Likely 9 will be around 50% to not even motivate the user to set
>> something that does not make sense (creep).
> 
> One observation we've had from production experiments is that the
> optimal number here isn't static. If you have plenty of memory, then
> even very sparse THPs are beneficial.

Exactly.

And willy suggested something like "eagerness" similar to "swapinness" 
that gives us more flexibility when implementing it, including 
dynamically adjusting the values in the future.

> 
> An extreme example: if all your THPs have 2/512 pages populated,
> that's still cutting TLB pressure in half!

IIRC, you create more pressure on the huge entries, where you might have 
less TLB entries :) But yes, there can be cases where it is beneficial, 
if there is absolutely no memory pressure.

> 
> So in the absence of memory pressure, allocating and collapsing should
> optimally be aggressive even on very sparse regions.

Yes, we discussed that as well in the THP cabal.

It's very similar to the max_ptes_swapped: that parameter should not 
exist. If there is no memory pressure we can just swap it in. If there 
is memory pressure we probably would not want to swap in much.

> 
> On the flipside, if there is memory pressure, TLB benefits are very
> quickly drowned out by faults and paging events. And I mean real
> memory pressure. If all that's happening is that somebody is streaming
> through filesystem data, the optimal behavior is still to be greedy.
> 
> Another consideration is that if we need to break large folios, we
> should start with colder ones that provide less benefit, and defer the
> splitting of hotter ones as long as possible.

Yes, we discussed that as well: there is no QoS right now, which is 
rather suboptimal.

> 
> Maybe a good direction would be to move splitting out of the shrinker
> and tie it to the (refault-aware) anon reclaim. And then instead of a
> fixed population threshold, collapse on a pressure gradient that
> starts with "no pressure/thrashing and at least two base pages in THP
> a region" and ends with "reclaim is splitting everything, back off".

I agree, but have to think further about how that could work in practice.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 12:19 ` Kiryl Shutsemau
  2025-09-12 12:25   ` David Hildenbrand
@ 2025-09-12 13:47   ` David Hildenbrand
  2025-09-12 14:28     ` David Hildenbrand
  2025-09-12 23:35     ` Nico Pache
  1 sibling, 2 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 13:47 UTC (permalink / raw)
  To: Kiryl Shutsemau, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 12.09.25 14:19, Kiryl Shutsemau wrote:
> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>> The following series provides khugepaged with the capability to collapse
>> anonymous memory regions to mTHPs.
>>
>> To achieve this we generalize the khugepaged functions to no longer depend
>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>> range. The restriction on max_ptes_none is removed during the scan, to make
>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>> by the attempted collapse order to determine how full a mTHP must be to be
>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>> contains swapped out, or shared pages, we don't perform the collapse. It is
>> now also possible to collapse to mTHPs without requiring the PMD THP size
>> to be enabled.
>>
>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>> constantly promoting mTHPs to the next available size, which would occur
>> because a collapse introduces more non-zero pages that would satisfy the
>> promotion condition on subsequent scans.
> 
> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.

BTW, I thought further about this and I agree: if we count zero-filled 
pages towards none_or_zero one we can avoid the "creep" problem.

The scanning-for-zero part is rather nasty, though.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 13:46       ` David Hildenbrand
@ 2025-09-12 14:01         ` Lorenzo Stoakes
  2025-09-12 15:35           ` Pedro Falcato
  2025-09-12 15:15         ` Pedro Falcato
  2025-09-15 13:43         ` Johannes Weiner
  2 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 14:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc,
	linux-kernel, linux-trace-kernel, ziy, baolin.wang, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, rientjes,
	mhocko, rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt,
	jannh, pfalcato

On Fri, Sep 12, 2025 at 03:46:36PM +0200, David Hildenbrand wrote:
> On 12.09.25 15:37, Johannes Weiner wrote:
> > On Fri, Sep 12, 2025 at 02:25:31PM +0200, David Hildenbrand wrote:
> > > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > > > > The following series provides khugepaged with the capability to collapse
> > > > > anonymous memory regions to mTHPs.
> > > > >
> > > > > To achieve this we generalize the khugepaged functions to no longer depend
> > > > > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > > > > pages that are occupied (!none/zero). After the PMD scan is done, we do
> > > > > binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > > > > range. The restriction on max_ptes_none is removed during the scan, to make
> > > > > sure we account for the whole PMD range. When no mTHP size is enabled, the
> > > > > legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > > > > by the attempted collapse order to determine how full a mTHP must be to be
> > > > > eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > > > > contains swapped out, or shared pages, we don't perform the collapse. It is
> > > > > now also possible to collapse to mTHPs without requiring the PMD THP size
> > > > > to be enabled.
> > > > >
> > > > > When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > > > > 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > > > > mTHP collapses to prevent collapse "creep" behavior. This prevents
> > > > > constantly promoting mTHPs to the next available size, which would occur
> > > > > because a collapse introduces more non-zero pages that would satisfy the
> > > > > promotion condition on subsequent scans.
> > > >
> > > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > > >
> > >
> > > I am all for not adding any more ugliness on top of all the ugliness we
> > > added in the past.
> > >
> > > I will soon propose deprecating that parameter in favor of something
> > > that makes a bit more sense.
> > >
> > > In essence, we'll likely have an "eagerness" parameter that ranges from
> > > 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> > > not all is populated".
> > >
> > > In between we will have more flexibility on how to set these values.
> > >
> > > Likely 9 will be around 50% to not even motivate the user to set
> > > something that does not make sense (creep).
> >
> > One observation we've had from production experiments is that the
> > optimal number here isn't static. If you have plenty of memory, then
> > even very sparse THPs are beneficial.
>
> Exactly.
>
> And willy suggested something like "eagerness" similar to "swapinness" that
> gives us more flexibility when implementing it, including dynamically
> adjusting the values in the future.

I like the idea of abstracting it like this, and - in a rare case of kernel
developer agreement (esp. around naming :) - both Matthew, David and I rather
loved referring to this as 'eagerness' here :)

The great benefit in relation to dynamic state is that we can simply treat this
as an _abstract_ thing. I.e. 'how eager are we to establish THPs, trading off
against memory pressure and higher order folio resource consumption'.

And then we can decide how precisely that is implemented in practice - and a
sensible approach would indeed be to differentiate between scenarios where we
might be more willing to chomp up memory vs. those we are not.

This also aligns nicely with the 'grand glorious future' we all dream off (don't
we??) in THP where things are automated as much as possible and the _kernel
decides_ what's best as far as is possible.

As with swappiness, it is essentially a 'hint' to us in abstract terms rather
than simply exposing an internal kernel parameter.

(Credit to Matthew for making this abstraction suggestion in the THP cabal
meeting by the way!)

>
> >
> > An extreme example: if all your THPs have 2/512 pages populated,
> > that's still cutting TLB pressure in half!
>
> IIRC, you create more pressure on the huge entries, where you might have
> less TLB entries :) But yes, there can be cases where it is beneficial, if
> there is absolutely no memory pressure.
>
> >
> > So in the absence of memory pressure, allocating and collapsing should
> > optimally be aggressive even on very sparse regions.
>
> Yes, we discussed that as well in the THP cabal.
>
> It's very similar to the max_ptes_swapped: that parameter should not exist.
> If there is no memory pressure we can just swap it in. If there is memory
> pressure we probably would not want to swap in much.

Yes, but at least an eagerness parameter gets us closer to this ideal.

Of course, I agree that max_ptes_none should simply never have been exposed like
this. It is emblematic of a 'just shove a parameter into a tunable/sysfs and let
the user decide' approach you see in the kernel sometimes.

This is problmeatic as users have no earthly idea how to set the parameter (most
likely never touch it), and only start fiddling should issues arise and it looks
like a viable solution of some kind.

The problem is users usually lack a great deal of context the kernel has, and
may make incorrect decisions that work in one situation but not another.

TL;DR - this kind of interface is just lazy and we have to assess these kinds of
tunables based on the actual RoI + understanding from the user's perspective.

>
> >
> > On the flipside, if there is memory pressure, TLB benefits are very
> > quickly drowned out by faults and paging events. And I mean real
> > memory pressure. If all that's happening is that somebody is streaming
> > through filesystem data, the optimal behavior is still to be greedy.
> >
> > Another consideration is that if we need to break large folios, we
> > should start with colder ones that provide less benefit, and defer the
> > splitting of hotter ones as long as possible.
>
> Yes, we discussed that as well: there is no QoS right now, which is rather
> suboptimal.

It's also kinda funny that the max_pte_none default is 511 right now so pretty
damn eager. Which might be part of the reason people often observe THP chomping
through resources...

>
> >
> > Maybe a good direction would be to move splitting out of the shrinker
> > and tie it to the (refault-aware) anon reclaim. And then instead of a
> > fixed population threshold, collapse on a pressure gradient that
> > starts with "no pressure/thrashing and at least two base pages in THP
> > a region" and ends with "reclaim is splitting everything, back off".
>
> I agree, but have to think further about how that could work in practice.

That'd be lovely actually!

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 13:47   ` David Hildenbrand
@ 2025-09-12 14:28     ` David Hildenbrand
  2025-09-12 14:35       ` Kiryl Shutsemau
  2025-09-12 23:35     ` Nico Pache
  1 sibling, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 14:28 UTC (permalink / raw)
  To: Kiryl Shutsemau, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 12.09.25 15:47, David Hildenbrand wrote:
> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>> The following series provides khugepaged with the capability to collapse
>>> anonymous memory regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>> by the attempted collapse order to determine how full a mTHP must be to be
>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>> to be enabled.
>>>
>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>> constantly promoting mTHPs to the next available size, which would occur
>>> because a collapse introduces more non-zero pages that would satisfy the
>>> promotion condition on subsequent scans.
>>
>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> 
> BTW, I thought further about this and I agree: if we count zero-filled
> pages towards none_or_zero one we can avoid the "creep" problem.
> 
> The scanning-for-zero part is rather nasty, though.

Aaand, thinking again from the other direction, this would mean that 
just because pages became zero after some time that we would no longer 
collapse because none_or_zero would then be higher. Hm ....

How I hate all of this so very very much :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 14:28     ` David Hildenbrand
@ 2025-09-12 14:35       ` Kiryl Shutsemau
  2025-09-12 14:56         ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Kiryl Shutsemau @ 2025-09-12 14:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
> On 12.09.25 15:47, David Hildenbrand wrote:
> > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > > > The following series provides khugepaged with the capability to collapse
> > > > anonymous memory regions to mTHPs.
> > > > 
> > > > To achieve this we generalize the khugepaged functions to no longer depend
> > > > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > > > pages that are occupied (!none/zero). After the PMD scan is done, we do
> > > > binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > > > range. The restriction on max_ptes_none is removed during the scan, to make
> > > > sure we account for the whole PMD range. When no mTHP size is enabled, the
> > > > legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > > > by the attempted collapse order to determine how full a mTHP must be to be
> > > > eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > > > contains swapped out, or shared pages, we don't perform the collapse. It is
> > > > now also possible to collapse to mTHPs without requiring the PMD THP size
> > > > to be enabled.
> > > > 
> > > > When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > > > 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > > > mTHP collapses to prevent collapse "creep" behavior. This prevents
> > > > constantly promoting mTHPs to the next available size, which would occur
> > > > because a collapse introduces more non-zero pages that would satisfy the
> > > > promotion condition on subsequent scans.
> > > 
> > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > 
> > BTW, I thought further about this and I agree: if we count zero-filled
> > pages towards none_or_zero one we can avoid the "creep" problem.
> > 
> > The scanning-for-zero part is rather nasty, though.
> 
> Aaand, thinking again from the other direction, this would mean that just
> because pages became zero after some time that we would no longer collapse
> because none_or_zero would then be higher. Hm ....
> 
> How I hate all of this so very very much :)

This is not new. Shrinker has the same problem: it cannot distinguish
between hot 4k that happened to be zero from the 4k that is there just
because of we faulted in 2M a time.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 14:35       ` Kiryl Shutsemau
@ 2025-09-12 14:56         ` David Hildenbrand
  2025-09-12 15:41           ` Kiryl Shutsemau
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 14:56 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 12.09.25 16:35, Kiryl Shutsemau wrote:
> On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
>> On 12.09.25 15:47, David Hildenbrand wrote:
>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>>>> The following series provides khugepaged with the capability to collapse
>>>>> anonymous memory regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>>>> by the attempted collapse order to determine how full a mTHP must be to be
>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>>>> to be enabled.
>>>>>
>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>>>> constantly promoting mTHPs to the next available size, which would occur
>>>>> because a collapse introduces more non-zero pages that would satisfy the
>>>>> promotion condition on subsequent scans.
>>>>
>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>>>
>>> BTW, I thought further about this and I agree: if we count zero-filled
>>> pages towards none_or_zero one we can avoid the "creep" problem.
>>>
>>> The scanning-for-zero part is rather nasty, though.
>>
>> Aaand, thinking again from the other direction, this would mean that just
>> because pages became zero after some time that we would no longer collapse
>> because none_or_zero would then be higher. Hm ....
>>
>> How I hate all of this so very very much :)
> 
> This is not new. Shrinker has the same problem: it cannot distinguish
> between hot 4k that happened to be zero from the 4k that is there just
> because of we faulted in 2M a time.

Right. And so far that problem is isolated to the shrinker.

To me so far "none_or_zero" really meant "will I consume more memory 
when collapsing". That's not true for zero-filled pages, obviously.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 13:46       ` David Hildenbrand
  2025-09-12 14:01         ` Lorenzo Stoakes
@ 2025-09-12 15:15         ` Pedro Falcato
  2025-09-12 15:38           ` Kiryl Shutsemau
  2025-09-15 13:43         ` Johannes Weiner
  2 siblings, 1 reply; 79+ messages in thread
From: Pedro Falcato @ 2025-09-12 15:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc,
	linux-kernel, linux-trace-kernel, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh

On Fri, Sep 12, 2025 at 03:46:36PM +0200, David Hildenbrand wrote:
> On 12.09.25 15:37, Johannes Weiner wrote:
> > On Fri, Sep 12, 2025 at 02:25:31PM +0200, David Hildenbrand wrote:
> > > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > > > > The following series provides khugepaged with the capability to collapse
> > > > > anonymous memory regions to mTHPs.
> > > > > 
> > > > > To achieve this we generalize the khugepaged functions to no longer depend
> > > > > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > > > > pages that are occupied (!none/zero). After the PMD scan is done, we do
> > > > > binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > > > > range. The restriction on max_ptes_none is removed during the scan, to make
> > > > > sure we account for the whole PMD range. When no mTHP size is enabled, the
> > > > > legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > > > > by the attempted collapse order to determine how full a mTHP must be to be
> > > > > eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > > > > contains swapped out, or shared pages, we don't perform the collapse. It is
> > > > > now also possible to collapse to mTHPs without requiring the PMD THP size
> > > > > to be enabled.
> > > > > 
> > > > > When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > > > > 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > > > > mTHP collapses to prevent collapse "creep" behavior. This prevents
> > > > > constantly promoting mTHPs to the next available size, which would occur
> > > > > because a collapse introduces more non-zero pages that would satisfy the
> > > > > promotion condition on subsequent scans.
> > > > 
> > > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > > > 
> > > 
> > > I am all for not adding any more ugliness on top of all the ugliness we
> > > added in the past.
> > > 
> > > I will soon propose deprecating that parameter in favor of something
> > > that makes a bit more sense.
> > > 
> > > In essence, we'll likely have an "eagerness" parameter that ranges from
> > > 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> > > not all is populated".
> > > 
> > > In between we will have more flexibility on how to set these values.
> > > 
> > > Likely 9 will be around 50% to not even motivate the user to set
> > > something that does not make sense (creep).
> > 
> > One observation we've had from production experiments is that the
> > optimal number here isn't static. If you have plenty of memory, then
> > even very sparse THPs are beneficial.
> 
> Exactly.
> 
> And willy suggested something like "eagerness" similar to "swapinness" that
> gives us more flexibility when implementing it, including dynamically
> adjusting the values in the future.
>

Ideally we would be able to also apply this to the page faulting paths.
In many cases, there's no good reason to create a THP on the first fault...

> > 
> > An extreme example: if all your THPs have 2/512 pages populated,
> > that's still cutting TLB pressure in half!
> 
> IIRC, you create more pressure on the huge entries, where you might have
> less TLB entries :) But yes, there can be cases where it is beneficial, if
> there is absolutely no memory pressure.
>

Correct, but it depends on the microarchitecture. For modern x86_64 AMD, it
happens that the L1 TLB entries are shared between 4K/2M/1G. This was not
(is not?) the case for Intel, where e.g back on kabylake, you had separate
entries for 4K/2MB/1GB.

Maybe in the Great Glorious Future (how many of those do we have?!) it would
be a good idea to take this kinds of things into account. Just because we can
map a THP, doesn't mean we should.

Shower thought: it might be in these cases especially where the FreeBSD
reservation system comes in handy - best effort allocating a THP, but not
actually mapping it as such until you really _know_ it is hot - and until
then, memory reclaim can just break your THP down if it really needs to.

-- 
Pedro

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 14:01         ` Lorenzo Stoakes
@ 2025-09-12 15:35           ` Pedro Falcato
  2025-09-12 15:45             ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: Pedro Falcato @ 2025-09-12 15:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Johannes Weiner, Kiryl Shutsemau, Nico Pache,
	linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh

On Fri, Sep 12, 2025 at 03:01:02PM +0100, Lorenzo Stoakes wrote:
> On Fri, Sep 12, 2025 at 03:46:36PM +0200, David Hildenbrand wrote:
> > <snip>
> > Exactly.
> >
> > And willy suggested something like "eagerness" similar to "swapinness" that
> > gives us more flexibility when implementing it, including dynamically
> > adjusting the values in the future.
> 
> I like the idea of abstracting it like this, and - in a rare case of kernel
> developer agreement (esp. around naming :) - both Matthew, David and I rather
> loved referring to this as 'eagerness' here :)
> 
> The great benefit in relation to dynamic state is that we can simply treat this
> as an _abstract_ thing. I.e. 'how eager are we to establish THPs, trading off
> against memory pressure and higher order folio resource consumption'.
> 
> And then we can decide how precisely that is implemented in practice - and a
> sensible approach would indeed be to differentiate between scenarios where we
> might be more willing to chomp up memory vs. those we are not.
> 
> This also aligns nicely with the 'grand glorious future' we all dream off (don't
> we??) in THP where things are automated as much as possible and the _kernel
> decides_ what's best as far as is possible.
> 
> As with swappiness, it is essentially a 'hint' to us in abstract terms rather
> than simply exposing an internal kernel parameter.
> 
> (Credit to Matthew for making this abstraction suggestion in the THP cabal
> meeting by the way!)
> 
> >
> > >
> > > An extreme example: if all your THPs have 2/512 pages populated,
> > > that's still cutting TLB pressure in half!
> >
> > IIRC, you create more pressure on the huge entries, where you might have
> > less TLB entries :) But yes, there can be cases where it is beneficial, if
> > there is absolutely no memory pressure.
> >
> > >
> > > So in the absence of memory pressure, allocating and collapsing should
> > > optimally be aggressive even on very sparse regions.
> >
> > Yes, we discussed that as well in the THP cabal.
> >
> > It's very similar to the max_ptes_swapped: that parameter should not exist.
> > If there is no memory pressure we can just swap it in. If there is memory
> > pressure we probably would not want to swap in much.
> 
> Yes, but at least an eagerness parameter gets us closer to this ideal.
> 
> Of course, I agree that max_ptes_none should simply never have been exposed like
> this. It is emblematic of a 'just shove a parameter into a tunable/sysfs and let
> the user decide' approach you see in the kernel sometimes.
> 
> This is problmeatic as users have no earthly idea how to set the parameter (most
> likely never touch it), and only start fiddling should issues arise and it looks
> like a viable solution of some kind.
> 
> The problem is users usually lack a great deal of context the kernel has, and
> may make incorrect decisions that work in one situation but not another.

Note that in this case we really don't have much for context. We can trivially do
"check what number of ptes are mapped", but not anything much fancier. You can
also attempt to look at A bits (and/or check PG_referenced or PG_active). But
currently there's really nothing setup to collect this information in a timely
basis, and for anon memory (AFAIK) you only gauge this on reclaim, _if_ you
find the page itself.

The good news is that there are 3 or 4 separate movements for getting page
"temperature" information with their own special infra and daemons, for their
own special little features.

> 
> TL;DR - this kind of interface is just lazy and we have to assess these kinds of
> tunables based on the actual RoI + understanding from the user's perspective.

Fully agreed.

-- 
Pedro

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:15         ` Pedro Falcato
@ 2025-09-12 15:38           ` Kiryl Shutsemau
  2025-09-12 15:43             ` David Hildenbrand
  2025-09-12 15:44             ` Kiryl Shutsemau
  0 siblings, 2 replies; 79+ messages in thread
From: Kiryl Shutsemau @ 2025-09-12 15:38 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand, Johannes Weiner, Nico Pache, linux-mm,
	linux-doc, linux-kernel, linux-trace-kernel, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh

On Fri, Sep 12, 2025 at 04:15:23PM +0100, Pedro Falcato wrote:
> On Fri, Sep 12, 2025 at 03:46:36PM +0200, David Hildenbrand wrote:
> > On 12.09.25 15:37, Johannes Weiner wrote:
> > > On Fri, Sep 12, 2025 at 02:25:31PM +0200, David Hildenbrand wrote:
> > > > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > > > > > The following series provides khugepaged with the capability to collapse
> > > > > > anonymous memory regions to mTHPs.
> > > > > > 
> > > > > > To achieve this we generalize the khugepaged functions to no longer depend
> > > > > > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > > > > > pages that are occupied (!none/zero). After the PMD scan is done, we do
> > > > > > binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > > > > > range. The restriction on max_ptes_none is removed during the scan, to make
> > > > > > sure we account for the whole PMD range. When no mTHP size is enabled, the
> > > > > > legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > > > > > by the attempted collapse order to determine how full a mTHP must be to be
> > > > > > eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > > > > > contains swapped out, or shared pages, we don't perform the collapse. It is
> > > > > > now also possible to collapse to mTHPs without requiring the PMD THP size
> > > > > > to be enabled.
> > > > > > 
> > > > > > When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > > > > > 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > > > > > mTHP collapses to prevent collapse "creep" behavior. This prevents
> > > > > > constantly promoting mTHPs to the next available size, which would occur
> > > > > > because a collapse introduces more non-zero pages that would satisfy the
> > > > > > promotion condition on subsequent scans.
> > > > > 
> > > > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > > > > 
> > > > 
> > > > I am all for not adding any more ugliness on top of all the ugliness we
> > > > added in the past.
> > > > 
> > > > I will soon propose deprecating that parameter in favor of something
> > > > that makes a bit more sense.
> > > > 
> > > > In essence, we'll likely have an "eagerness" parameter that ranges from
> > > > 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> > > > not all is populated".
> > > > 
> > > > In between we will have more flexibility on how to set these values.
> > > > 
> > > > Likely 9 will be around 50% to not even motivate the user to set
> > > > something that does not make sense (creep).
> > > 
> > > One observation we've had from production experiments is that the
> > > optimal number here isn't static. If you have plenty of memory, then
> > > even very sparse THPs are beneficial.
> > 
> > Exactly.
> > 
> > And willy suggested something like "eagerness" similar to "swapinness" that
> > gives us more flexibility when implementing it, including dynamically
> > adjusting the values in the future.
> >
> 
> Ideally we would be able to also apply this to the page faulting paths.
> In many cases, there's no good reason to create a THP on the first fault...
> 
> > > 
> > > An extreme example: if all your THPs have 2/512 pages populated,
> > > that's still cutting TLB pressure in half!
> > 
> > IIRC, you create more pressure on the huge entries, where you might have
> > less TLB entries :) But yes, there can be cases where it is beneficial, if
> > there is absolutely no memory pressure.
> >
> 
> Correct, but it depends on the microarchitecture. For modern x86_64 AMD, it
> happens that the L1 TLB entries are shared between 4K/2M/1G. This was not
> (is not?) the case for Intel, where e.g back on kabylake, you had separate
> entries for 4K/2MB/1GB.

On Intel secondary TLB is shared between 4k and 2M. L2 TLB for 1G is
separate.

> Maybe in the Great Glorious Future (how many of those do we have?!) it would
> be a good idea to take this kinds of things into account. Just because we can
> map a THP, doesn't mean we should.
> 
> Shower thought: it might be in these cases especially where the FreeBSD
> reservation system comes in handy - best effort allocating a THP, but not
> actually mapping it as such until you really _know_ it is hot - and until
> then, memory reclaim can just break your THP down if it really needs to.

This is just silly. All downsides without benefit until maybe later. And
for short-lived processes the "later" never comes.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 14:56         ` David Hildenbrand
@ 2025-09-12 15:41           ` Kiryl Shutsemau
  2025-09-12 15:45             ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Kiryl Shutsemau @ 2025-09-12 15:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Fri, Sep 12, 2025 at 04:56:47PM +0200, David Hildenbrand wrote:
> On 12.09.25 16:35, Kiryl Shutsemau wrote:
> > On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
> > > On 12.09.25 15:47, David Hildenbrand wrote:
> > > > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > > > > > The following series provides khugepaged with the capability to collapse
> > > > > > anonymous memory regions to mTHPs.
> > > > > > 
> > > > > > To achieve this we generalize the khugepaged functions to no longer depend
> > > > > > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > > > > > pages that are occupied (!none/zero). After the PMD scan is done, we do
> > > > > > binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > > > > > range. The restriction on max_ptes_none is removed during the scan, to make
> > > > > > sure we account for the whole PMD range. When no mTHP size is enabled, the
> > > > > > legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > > > > > by the attempted collapse order to determine how full a mTHP must be to be
> > > > > > eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > > > > > contains swapped out, or shared pages, we don't perform the collapse. It is
> > > > > > now also possible to collapse to mTHPs without requiring the PMD THP size
> > > > > > to be enabled.
> > > > > > 
> > > > > > When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > > > > > 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > > > > > mTHP collapses to prevent collapse "creep" behavior. This prevents
> > > > > > constantly promoting mTHPs to the next available size, which would occur
> > > > > > because a collapse introduces more non-zero pages that would satisfy the
> > > > > > promotion condition on subsequent scans.
> > > > > 
> > > > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > > > 
> > > > BTW, I thought further about this and I agree: if we count zero-filled
> > > > pages towards none_or_zero one we can avoid the "creep" problem.
> > > > 
> > > > The scanning-for-zero part is rather nasty, though.
> > > 
> > > Aaand, thinking again from the other direction, this would mean that just
> > > because pages became zero after some time that we would no longer collapse
> > > because none_or_zero would then be higher. Hm ....
> > > 
> > > How I hate all of this so very very much :)
> > 
> > This is not new. Shrinker has the same problem: it cannot distinguish
> > between hot 4k that happened to be zero from the 4k that is there just
> > because of we faulted in 2M a time.
> 
> Right. And so far that problem is isolated to the shrinker.
> 
> To me so far "none_or_zero" really meant "will I consume more memory when
> collapsing". That's not true for zero-filled pages, obviously.

Well, KSM can reclaim these zero-filled memory until we collapse it.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:38           ` Kiryl Shutsemau
@ 2025-09-12 15:43             ` David Hildenbrand
  2025-09-12 15:44             ` Kiryl Shutsemau
  1 sibling, 0 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 15:43 UTC (permalink / raw)
  To: Kiryl Shutsemau, Pedro Falcato
  Cc: Johannes Weiner, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, rientjes,
	mhocko, rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt,
	jannh

>> Maybe in the Great Glorious Future (how many of those do we have?!) it would
>> be a good idea to take this kinds of things into account. Just because we can
>> map a THP, doesn't mean we should.
>>
>> Shower thought: it might be in these cases especially where the FreeBSD
>> reservation system comes in handy - best effort allocating a THP, but not
>> actually mapping it as such until you really _know_ it is hot - and until
>> then, memory reclaim can just break your THP down if it really needs to.
> 
> This is just silly. All downsides without benefit until maybe later. And
> for short-lived processes the "later" never comes.

Right, that's why I've also been arguing against that (we discussed it 
recently in the THP cabal).

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:38           ` Kiryl Shutsemau
  2025-09-12 15:43             ` David Hildenbrand
@ 2025-09-12 15:44             ` Kiryl Shutsemau
  2025-09-12 15:51               ` David Hildenbrand
  1 sibling, 1 reply; 79+ messages in thread
From: Kiryl Shutsemau @ 2025-09-12 15:44 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand, Johannes Weiner, Nico Pache, linux-mm,
	linux-doc, linux-kernel, linux-trace-kernel, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh

On Fri, Sep 12, 2025 at 04:39:02PM +0100, Kiryl Shutsemau wrote:
> > Shower thought: it might be in these cases especially where the FreeBSD
> > reservation system comes in handy - best effort allocating a THP, but not
> > actually mapping it as such until you really _know_ it is hot - and until
> > then, memory reclaim can just break your THP down if it really needs to.
> 
> This is just silly. All downsides without benefit until maybe later. And
> for short-lived processes the "later" never comes.

The right way out is to get better info on access pattern from hardware.
For instance, if we move access bit out of page table entry and make it
independent of the actually mapping size that would give us much better
view on what actually is going on.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:41           ` Kiryl Shutsemau
@ 2025-09-12 15:45             ` David Hildenbrand
  2025-09-12 15:51               ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 15:45 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 12.09.25 17:41, Kiryl Shutsemau wrote:
> On Fri, Sep 12, 2025 at 04:56:47PM +0200, David Hildenbrand wrote:
>> On 12.09.25 16:35, Kiryl Shutsemau wrote:
>>> On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
>>>> On 12.09.25 15:47, David Hildenbrand wrote:
>>>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>>>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>>>>>> The following series provides khugepaged with the capability to collapse
>>>>>>> anonymous memory regions to mTHPs.
>>>>>>>
>>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>>>>>> by the attempted collapse order to determine how full a mTHP must be to be
>>>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>>>>>> to be enabled.
>>>>>>>
>>>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>>>>>> constantly promoting mTHPs to the next available size, which would occur
>>>>>>> because a collapse introduces more non-zero pages that would satisfy the
>>>>>>> promotion condition on subsequent scans.
>>>>>>
>>>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>>>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>>>>>
>>>>> BTW, I thought further about this and I agree: if we count zero-filled
>>>>> pages towards none_or_zero one we can avoid the "creep" problem.
>>>>>
>>>>> The scanning-for-zero part is rather nasty, though.
>>>>
>>>> Aaand, thinking again from the other direction, this would mean that just
>>>> because pages became zero after some time that we would no longer collapse
>>>> because none_or_zero would then be higher. Hm ....
>>>>
>>>> How I hate all of this so very very much :)
>>>
>>> This is not new. Shrinker has the same problem: it cannot distinguish
>>> between hot 4k that happened to be zero from the 4k that is there just
>>> because of we faulted in 2M a time.
>>
>> Right. And so far that problem is isolated to the shrinker.
>>
>> To me so far "none_or_zero" really meant "will I consume more memory when
>> collapsing". That's not true for zero-filled pages, obviously.
> 
> Well, KSM can reclaim these zero-filled memory until we collapse it.

KSM is used so rarely (for good reasons) that I would never ever build 
an argument based on its existence :P

But yes: during the very first shrinker discussion I raised that KSM can 
do the same thing. Obviously that was not good enough.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:35           ` Pedro Falcato
@ 2025-09-12 15:45             ` Lorenzo Stoakes
  0 siblings, 0 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 15:45 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand, Johannes Weiner, Kiryl Shutsemau, Nico Pache,
	linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh

On Fri, Sep 12, 2025 at 04:35:44PM +0100, Pedro Falcato wrote:
> On Fri, Sep 12, 2025 at 03:01:02PM +0100, Lorenzo Stoakes wrote:
> > Yes, but at least an eagerness parameter gets us closer to this ideal.
> >
> > Of course, I agree that max_ptes_none should simply never have been exposed like
> > this. It is emblematic of a 'just shove a parameter into a tunable/sysfs and let
> > the user decide' approach you see in the kernel sometimes.
> >
> > This is problmeatic as users have no earthly idea how to set the parameter (most
> > likely never touch it), and only start fiddling should issues arise and it looks
> > like a viable solution of some kind.
> >
> > The problem is users usually lack a great deal of context the kernel has, and
> > may make incorrect decisions that work in one situation but not another.
>
> Note that in this case we really don't have much for context. We can trivially do
> "check what number of ptes are mapped", but not anything much fancier. You can

I mean we could in theory change where we determine things, for instance doing
things in reclaim as Kiryl alluded to.

We _potentially_ have more to work with.

>
> The good news is that there are 3 or 4 separate movements for getting page
> "temperature" information with their own special infra and daemons, for their
> own special little features.

Right.

>
> >
> > TL;DR - this kind of interface is just lazy and we have to assess these kinds of
> > tunables based on the actual RoI + understanding from the user's perspective.
>
> Fully agreed.
>
> --
> Pedro

My overall point, FWIW, is that a synthetic heuristic tunable works better here
than one that maps on to an internal value that we then have no control over.

Or 'I agree with David' IOW :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:45             ` David Hildenbrand
@ 2025-09-12 15:51               ` Lorenzo Stoakes
  2025-09-12 17:53                 ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 15:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Fri, Sep 12, 2025 at 05:45:26PM +0200, David Hildenbrand wrote:
> On 12.09.25 17:41, Kiryl Shutsemau wrote:
> > On Fri, Sep 12, 2025 at 04:56:47PM +0200, David Hildenbrand wrote:
> > > On 12.09.25 16:35, Kiryl Shutsemau wrote:
> > > > On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
> > > > > On 12.09.25 15:47, David Hildenbrand wrote:
> > > > > > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > > > > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > > > > > > > The following series provides khugepaged with the capability to collapse
> > > > > > > > anonymous memory regions to mTHPs.
> > > > > > > >
> > > > > > > > To achieve this we generalize the khugepaged functions to no longer depend
> > > > > > > > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > > > > > > > pages that are occupied (!none/zero). After the PMD scan is done, we do
> > > > > > > > binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > > > > > > > range. The restriction on max_ptes_none is removed during the scan, to make
> > > > > > > > sure we account for the whole PMD range. When no mTHP size is enabled, the
> > > > > > > > legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > > > > > > > by the attempted collapse order to determine how full a mTHP must be to be
> > > > > > > > eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > > > > > > > contains swapped out, or shared pages, we don't perform the collapse. It is
> > > > > > > > now also possible to collapse to mTHPs without requiring the PMD THP size
> > > > > > > > to be enabled.
> > > > > > > >
> > > > > > > > When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > > > > > > > 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > > > > > > > mTHP collapses to prevent collapse "creep" behavior. This prevents
> > > > > > > > constantly promoting mTHPs to the next available size, which would occur
> > > > > > > > because a collapse introduces more non-zero pages that would satisfy the
> > > > > > > > promotion condition on subsequent scans.
> > > > > > >
> > > > > > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > > > > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > > > > >
> > > > > > BTW, I thought further about this and I agree: if we count zero-filled
> > > > > > pages towards none_or_zero one we can avoid the "creep" problem.
> > > > > >
> > > > > > The scanning-for-zero part is rather nasty, though.
> > > > >
> > > > > Aaand, thinking again from the other direction, this would mean that just
> > > > > because pages became zero after some time that we would no longer collapse
> > > > > because none_or_zero would then be higher. Hm ....
> > > > >
> > > > > How I hate all of this so very very much :)
> > > >
> > > > This is not new. Shrinker has the same problem: it cannot distinguish
> > > > between hot 4k that happened to be zero from the 4k that is there just
> > > > because of we faulted in 2M a time.
> > >
> > > Right. And so far that problem is isolated to the shrinker.
> > >
> > > To me so far "none_or_zero" really meant "will I consume more memory when
> > > collapsing". That's not true for zero-filled pages, obviously.
> >
> > Well, KSM can reclaim these zero-filled memory until we collapse it.
>
> KSM is used so rarely (for good reasons) that I would never ever build an
> argument based on its existence :P
>
> But yes: during the very first shrinker discussion I raised that KSM can do
> the same thing. Obviously that was not good enough.
>
> --
> Cheers
>
> David / dhildenb
>

With all this stuff said, do we have an actual plan for what we intend to do
_now_?

As Nico has implemented a basic solution here that we all seem to agree is not
what we want.

Without needing special new hardware or major reworks, what would this parameter
look like?

What would the heuristics be? What about the eagerness scales?

I'm but a simple kernel developer, and interested in simple pragmatic stuff :)
do you have a plan right now David?

Maybe we can start with something simple like a rough percentage per eagerness
entry that then gets scaled based on utilisation?

I'd like us to ideally have something to suggest to Nico before the next respin.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:44             ` Kiryl Shutsemau
@ 2025-09-12 15:51               ` David Hildenbrand
  0 siblings, 0 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 15:51 UTC (permalink / raw)
  To: Kiryl Shutsemau, Pedro Falcato
  Cc: Johannes Weiner, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, rientjes,
	mhocko, rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt,
	jannh

On 12.09.25 17:44, Kiryl Shutsemau wrote:
> On Fri, Sep 12, 2025 at 04:39:02PM +0100, Kiryl Shutsemau wrote:
>>> Shower thought: it might be in these cases especially where the FreeBSD
>>> reservation system comes in handy - best effort allocating a THP, but not
>>> actually mapping it as such until you really _know_ it is hot - and until
>>> then, memory reclaim can just break your THP down if it really needs to.
>>
>> This is just silly. All downsides without benefit until maybe later. And
>> for short-lived processes the "later" never comes.
> 
> The right way out is to get better info on access pattern from hardware.
> For instance, if we move access bit out of page table entry and make it
> independent of the actually mapping size that would give us much better
> view on what actually is going on.

We discussed this a couple of times in the past, the problem is that it 
does not help anybody really if all but a handful piece of hardware 
provides such a features.

Long long long term I agree, short term we cannot really build core 
infrastructure around any of that.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 15:51               ` Lorenzo Stoakes
@ 2025-09-12 17:53                 ` David Hildenbrand
  2025-09-12 18:21                   ` Lorenzo Stoakes
  2025-09-13  0:18                   ` Nico Pache
  0 siblings, 2 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-12 17:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 12.09.25 17:51, Lorenzo Stoakes wrote:
> On Fri, Sep 12, 2025 at 05:45:26PM +0200, David Hildenbrand wrote:
>> On 12.09.25 17:41, Kiryl Shutsemau wrote:
>>> On Fri, Sep 12, 2025 at 04:56:47PM +0200, David Hildenbrand wrote:
>>>> On 12.09.25 16:35, Kiryl Shutsemau wrote:
>>>>> On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
>>>>>> On 12.09.25 15:47, David Hildenbrand wrote:
>>>>>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>>>>>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>>>>>>>> The following series provides khugepaged with the capability to collapse
>>>>>>>>> anonymous memory regions to mTHPs.
>>>>>>>>>
>>>>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>>>>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>>>>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>>>>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>>>>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>>>>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>>>>>>>> by the attempted collapse order to determine how full a mTHP must be to be
>>>>>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>>>>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>>>>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>>>>>>>> to be enabled.
>>>>>>>>>
>>>>>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>>>>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>>>>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>>>>>>>> constantly promoting mTHPs to the next available size, which would occur
>>>>>>>>> because a collapse introduces more non-zero pages that would satisfy the
>>>>>>>>> promotion condition on subsequent scans.
>>>>>>>>
>>>>>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>>>>>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>>>>>>>
>>>>>>> BTW, I thought further about this and I agree: if we count zero-filled
>>>>>>> pages towards none_or_zero one we can avoid the "creep" problem.
>>>>>>>
>>>>>>> The scanning-for-zero part is rather nasty, though.
>>>>>>
>>>>>> Aaand, thinking again from the other direction, this would mean that just
>>>>>> because pages became zero after some time that we would no longer collapse
>>>>>> because none_or_zero would then be higher. Hm ....
>>>>>>
>>>>>> How I hate all of this so very very much :)
>>>>>
>>>>> This is not new. Shrinker has the same problem: it cannot distinguish
>>>>> between hot 4k that happened to be zero from the 4k that is there just
>>>>> because of we faulted in 2M a time.
>>>>
>>>> Right. And so far that problem is isolated to the shrinker.
>>>>
>>>> To me so far "none_or_zero" really meant "will I consume more memory when
>>>> collapsing". That's not true for zero-filled pages, obviously.
>>>
>>> Well, KSM can reclaim these zero-filled memory until we collapse it.
>>
>> KSM is used so rarely (for good reasons) that I would never ever build an
>> argument based on its existence :P
>>
>> But yes: during the very first shrinker discussion I raised that KSM can do
>> the same thing. Obviously that was not good enough.
>>
>> --
>> Cheers
>>
>> David / dhildenb
>>
> 
> With all this stuff said, do we have an actual plan for what we intend to do
> _now_?

Oh no, no I have to use my brain and it's Friday evening.

> 
> As Nico has implemented a basic solution here that we all seem to agree is not
> what we want.
> 
> Without needing special new hardware or major reworks, what would this parameter
> look like?
> 
> What would the heuristics be? What about the eagerness scales?
> 
> I'm but a simple kernel developer, 

:)

and interested in simple pragmatic stuff :)
> do you have a plan right now David?

Ehm, if you ask me that way ...

> 
> Maybe we can start with something simple like a rough percentage per eagerness
> entry that then gets scaled based on utilisation?

... I think we should probably:

1) Start with something very simple for mTHP that doesn't lock us into any particular direction.

2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well

3) Improve that "eagerness" algorithm using a dynamic scale or #whatever

4) Solve world peace and world hunger

5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever


I maintain my initial position that just using

max_ptes_none == 511 -> collapse mTHP always
max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero

As a starting point is probably simple and best, and likely leaves room for any
changes later.


Of course, we could do what Nico is proposing here, as 1) and change it all later.

It's just when it comes to documenting all that stuff in patch #15 that I feel like
"alright, we shouldn't be doing it longterm like that, so let's not make anybody
depend on any weird behavior here by over-domenting it".

I mean

"
+To prevent "creeping" behavior where collapses continuously promote to larger
+orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
+capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
+that introducing more than half of the pages to be non-zero it will always
+satisfy the eligibility check on the next scan and the region will be collapse.
"

Is just way, way to detailed.

I would just say "The kernel might decide to use a more conservative approach
when collapsing smaller THPs" etc.


Thoughts?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 17:53                 ` David Hildenbrand
@ 2025-09-12 18:21                   ` Lorenzo Stoakes
  2025-09-13  0:28                     ` Nico Pache
  2025-09-15 10:25                     ` David Hildenbrand
  2025-09-13  0:18                   ` Nico Pache
  1 sibling, 2 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 18:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:
> On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > With all this stuff said, do we have an actual plan for what we intend to do
> > _now_?
>
> Oh no, no I have to use my brain and it's Friday evening.

I apologise :)

>
> >
> > As Nico has implemented a basic solution here that we all seem to agree is not
> > what we want.
> >
> > Without needing special new hardware or major reworks, what would this parameter
> > look like?
> >
> > What would the heuristics be? What about the eagerness scales?
> >
> > I'm but a simple kernel developer,
>
> :)
>
> and interested in simple pragmatic stuff :)
> > do you have a plan right now David?
>
> Ehm, if you ask me that way ...
>
> >
> > Maybe we can start with something simple like a rough percentage per eagerness
> > entry that then gets scaled based on utilisation?
>
> ... I think we should probably:
>
> 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.

Yes.

>
> 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well

Yes I think we're all pretty onboard with that it seems!

>
> 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever

Right, I feel like we could start with some very simple linear thing here and
later maybe refine it?

>
> 4) Solve world peace and world hunger

Yes! That would be pretty great ;)

>
> 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever

I think these are TODOs :)

>
>
> I maintain my initial position that just using
>
> max_ptes_none == 511 -> collapse mTHP always
> max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero
>
> As a starting point is probably simple and best, and likely leaves room for any
> changes later.

Yes.

>
>
> Of course, we could do what Nico is proposing here, as 1) and change it all later.

Right.

But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess
given the 'creep' issue that's sensible.

>
> It's just when it comes to documenting all that stuff in patch #15 that I feel like
> "alright, we shouldn't be doing it longterm like that, so let's not make anybody
> depend on any weird behavior here by over-domenting it".
>
> I mean
>
> "
> +To prevent "creeping" behavior where collapses continuously promote to larger
> +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> +that introducing more than half of the pages to be non-zero it will always
> +satisfy the eligibility check on the next scan and the region will be collapse.
> "
>
> Is just way, way to detailed.
>
> I would just say "The kernel might decide to use a more conservative approach
> when collapsing smaller THPs" etc.
>
>
> Thoughts?

Well I've sort of reviewed oppositely there :) well at least that it needs to be
a hell of a lot clearer (I find that comment really compressed and I just don't
really understand it).

I guess I didn't think about people reading that and relying on it, so maybe we
could alternatively make that succinct.

But I think it'd be better to say something like "mTHP collapse cannot currently
correctly function with half or more of the PTE entries empty, so we cap at just
below this level" in this case.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function
  2025-09-12 13:35   ` Lorenzo Stoakes
@ 2025-09-12 23:26     ` Nico Pache
  2025-09-15 10:30       ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: Nico Pache @ 2025-09-12 23:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kas, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 7:36 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Sep 11, 2025 at 09:28:01PM -0600, Nico Pache wrote:
> > The current mechanism for determining mTHP collapse scales the
> > khugepaged_max_ptes_none value based on the target order. This
> > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > is set to a value greater than HPAGE_PMD_NR / 2.
> >
> > With this configuration, a successful collapse to order N will populate
> > enough pages to satisfy the collapse condition on order N+1 on the next
> > scan. This leads to unnecessary work and memory churn.
> >
> > To fix this issue introduce a helper function that caps the max_ptes_none
> > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> > the max_ptes_none number by the (PMD_ORDER - target collapse order).
>
> I would say very clearly that this is only in the mTHP case.

ack, I stole most of the verbiage here from other notes I've
previously written, but it can be improved.

>
>
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> Hmm I thought we were going to wait for David to investigate different
> approaches to this?
>
> This is another issue with quickly going to another iteration. Though I do think
> David explicitly said he'd come back with a solution?

Sorry I thought that was being done in lockstep. The last version was
about a month ago and I had a lot of changes queued up. Now that we
have collapse_max_pte_none() David has a much easier entry point to
work off :)

I think he will still need this groundwork for the solution he is
working on with "eagerness". if 10 -> 511, and 9 ->255, ..., 0 -> 0.
It will still have to do the scaling. Although I believe 0-10 should
be more like 0-5 mapping to 0,32,64,128,255,511

>
> So I'm not sure why we're seeing this solution here? Unless I'm missing
> something?
>
> > ---
> >  mm/khugepaged.c | 22 +++++++++++++++++++++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b0ae0b63fc9b..4587f2def5c1 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -468,6 +468,26 @@ void __khugepaged_enter(struct mm_struct *mm)
> >               wake_up_interruptible(&khugepaged_wait);
> >  }
> >
> > +/* Returns the scaled max_ptes_none for a given order.
>
> We don't start comments at the /*, please use a normal comment format like:
ack
>
> /*
>  * xxxx
>  */
>
> > + * Caps the value to HPAGE_PMD_NR/2 - 1 in the case of mTHP collapse to prevent
>
> This is super unclear.
>
> It start with 'caps the xxx' which seems like you're talking generally.
>
> You should say very clearly 'For PMD allocations we apply the
> khugepaged_max_ptes_none parameter as normal. For mTHP ... [details about mTHP].
ack I will clean this up.
>
> > + * a feedback loop. If max_ptes_none is greater than HPAGE_PMD_NR/2, the value
> > + * would lead to collapses that introduces 2x more pages than the original
> > + * number of pages. On subsequent scans, the max_ptes_none check would be
> > + * satisfied and the collapses would continue until the largest order is reached
> > + */
>
> This is a super vauge explanation. Please describe the issue with creep more
> clearly.
ok I will try to come up with something clearer.
>
> Also aren't we saying that 511 or 0 are the sensible choices? But now somehow
> that's not the case?
Oh I stated I wanted to propose this, and although there was some
pushback I still thought it deserved another attempt. This still
allows for some configurability, and with David's eagerness toggle
this still seems to fit nicely.
>
> You're also not giving a kdoc info on what this returns.
Ok I'll add a kdoc here, why this function in particular, I'm trying
to understand why we dont add kdocs on other functions?
>
> > +static int collapse_max_ptes_none(unsigned int order)
>
> It's a problem that existed already, but khugepaged_max_ptes_none is an unsigned
> int and this returns int.
>
> Maybe we should fix this while we're at it...
ack
>
> > +{
> > +     int max_ptes_none;
> > +
> > +     if (order != HPAGE_PMD_ORDER &&
> > +         khugepaged_max_ptes_none >= HPAGE_PMD_NR/2)
> > +             max_ptes_none = HPAGE_PMD_NR/2 - 1;
> > +     else
> > +             max_ptes_none = khugepaged_max_ptes_none;
> > +     return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > +
> > +}
> > +
>
> I really don't like this formulation, you're making it unnecessarily unclear and
> now, for the super common case of PMD size, you have to figure out 'oh it's this
> second branch and we're subtracting HPAGE_PMD_ORDER from HPAGE_PMD_ORDER so just
> return khugepaged_max_ptes_none'. When we could... just return it no?
>
> So something like:
>
> #define MAX_PTES_NONE_MTHP_CAP (HPAGE_PMD_NR / 2 - 1)
>
> static unsigned int collapse_max_ptes_none(unsigned int order)
> {
>         unsigned int max_ptes_none_pmd;
>
>         /* PMD-sized THPs behave precisely the same as before. */
>         if (order == HPAGE_PMD_ORDER)
>                 return khugepaged_max_ptes_none;
>
>         /*
>         * Bizarrely, this is expressed in terms of PTEs were this PMD-sized.
>         * For the reasons stated above, we cap this value in the case of mTHP.
>         */
>         max_ptes_none_pmd = MIN(MAX_PTES_NONE_MTHP_CAP,
>                 khugepaged_max_ptes_none);
>
>         /* Apply PMD -> mTHP scaling. */
>         return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> }
yeah that's much cleaner thanks!
>
> >  void khugepaged_enter_vma(struct vm_area_struct *vma,
> >                         vm_flags_t vm_flags)
> >  {
> > @@ -554,7 +574,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >       struct folio *folio = NULL;
> >       pte_t *_pte;
> >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> > -     int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > +     int scaled_max_ptes_none = collapse_max_ptes_none(order);
> >       const unsigned long nr_pages = 1UL << order;
> >
> >       for (_pte = pte; _pte < pte + nr_pages;
> > --
> > 2.51.0
> >
>
> Thanks, Lorenzo
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 12:25   ` David Hildenbrand
  2025-09-12 13:37     ` Johannes Weiner
@ 2025-09-12 23:31     ` Nico Pache
  2025-09-15  9:22       ` Kiryl Shutsemau
  1 sibling, 1 reply; 79+ messages in thread
From: Nico Pache @ 2025-09-12 23:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> >> The following series provides khugepaged with the capability to collapse
> >> anonymous memory regions to mTHPs.
> >>
> >> To achieve this we generalize the khugepaged functions to no longer depend
> >> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >> pages that are occupied (!none/zero). After the PMD scan is done, we do
> >> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> >> range. The restriction on max_ptes_none is removed during the scan, to make
> >> sure we account for the whole PMD range. When no mTHP size is enabled, the
> >> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> >> by the attempted collapse order to determine how full a mTHP must be to be
> >> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> >> contains swapped out, or shared pages, we don't perform the collapse. It is
> >> now also possible to collapse to mTHPs without requiring the PMD THP size
> >> to be enabled.
> >>
> >> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> >> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> >> mTHP collapses to prevent collapse "creep" behavior. This prevents
> >> constantly promoting mTHPs to the next available size, which would occur
> >> because a collapse introduces more non-zero pages that would satisfy the
> >> promotion condition on subsequent scans.
> >
> > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> >
>
> I am all for not adding any more ugliness on top of all the ugliness we
> added in the past.
>
> I will soon propose deprecating that parameter in favor of something
> that makes a bit more sense.
>
> In essence, we'll likely have an "eagerness" parameter that ranges from
> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> not all is populated".
Hi David,

Do you have any reason for 0-10, I'm guessing these will map to
different max_ptes_none values.
I suggest 0-5, mapping to 0,32,64,128,255,511

You can take my collapse_max_ptes_none() function in this series and
rework it for the larger sysctl work you are doing.

Cheers,
-- Nico
>
> In between we will have more flexibility on how to set these values.
>
> Likely 9 will be around 50% to not even motivate the user to set
> something that does not make sense (creep).
>
> Of course, the old parameter will have to stick around in compat mode.
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 13:47   ` David Hildenbrand
  2025-09-12 14:28     ` David Hildenbrand
@ 2025-09-12 23:35     ` Nico Pache
  1 sibling, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-12 23:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 7:48 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> >> The following series provides khugepaged with the capability to collapse
> >> anonymous memory regions to mTHPs.
> >>
> >> To achieve this we generalize the khugepaged functions to no longer depend
> >> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >> pages that are occupied (!none/zero). After the PMD scan is done, we do
> >> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> >> range. The restriction on max_ptes_none is removed during the scan, to make
> >> sure we account for the whole PMD range. When no mTHP size is enabled, the
> >> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> >> by the attempted collapse order to determine how full a mTHP must be to be
> >> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> >> contains swapped out, or shared pages, we don't perform the collapse. It is
> >> now also possible to collapse to mTHPs without requiring the PMD THP size
> >> to be enabled.
> >>
> >> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> >> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> >> mTHP collapses to prevent collapse "creep" behavior. This prevents
> >> constantly promoting mTHPs to the next available size, which would occur
> >> because a collapse introduces more non-zero pages that would satisfy the
> >> promotion condition on subsequent scans.
> >
> > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>
> BTW, I thought further about this and I agree: if we count zero-filled
> pages towards none_or_zero one we can avoid the "creep" problem.
>
> The scanning-for-zero part is rather nasty, though.
IIRC me and David have discussed this in the past and decided to avoid
this approach at the moment because it would be complicated and
"nasty".
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 17:53                 ` David Hildenbrand
  2025-09-12 18:21                   ` Lorenzo Stoakes
@ 2025-09-13  0:18                   ` Nico Pache
  1 sibling, 0 replies; 79+ messages in thread
From: Nico Pache @ 2025-09-13  0:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Kiryl Shutsemau, linux-mm, linux-doc,
	linux-kernel, linux-trace-kernel, ziy, baolin.wang, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 11:53 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > On Fri, Sep 12, 2025 at 05:45:26PM +0200, David Hildenbrand wrote:
> >> On 12.09.25 17:41, Kiryl Shutsemau wrote:
> >>> On Fri, Sep 12, 2025 at 04:56:47PM +0200, David Hildenbrand wrote:
> >>>> On 12.09.25 16:35, Kiryl Shutsemau wrote:
> >>>>> On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
> >>>>>> On 12.09.25 15:47, David Hildenbrand wrote:
> >>>>>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
> >>>>>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> >>>>>>>>> The following series provides khugepaged with the capability to collapse
> >>>>>>>>> anonymous memory regions to mTHPs.
> >>>>>>>>>
> >>>>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>>>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >>>>>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
> >>>>>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> >>>>>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
> >>>>>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
> >>>>>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> >>>>>>>>> by the attempted collapse order to determine how full a mTHP must be to be
> >>>>>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> >>>>>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
> >>>>>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
> >>>>>>>>> to be enabled.
> >>>>>>>>>
> >>>>>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> >>>>>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> >>>>>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
> >>>>>>>>> constantly promoting mTHPs to the next available size, which would occur
> >>>>>>>>> because a collapse introduces more non-zero pages that would satisfy the
> >>>>>>>>> promotion condition on subsequent scans.
> >>>>>>>>
> >>>>>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> >>>>>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> >>>>>>>
> >>>>>>> BTW, I thought further about this and I agree: if we count zero-filled
> >>>>>>> pages towards none_or_zero one we can avoid the "creep" problem.
> >>>>>>>
> >>>>>>> The scanning-for-zero part is rather nasty, though.
> >>>>>>
> >>>>>> Aaand, thinking again from the other direction, this would mean that just
> >>>>>> because pages became zero after some time that we would no longer collapse
> >>>>>> because none_or_zero would then be higher. Hm ....
> >>>>>>
> >>>>>> How I hate all of this so very very much :)
> >>>>>
> >>>>> This is not new. Shrinker has the same problem: it cannot distinguish
> >>>>> between hot 4k that happened to be zero from the 4k that is there just
> >>>>> because of we faulted in 2M a time.
> >>>>
> >>>> Right. And so far that problem is isolated to the shrinker.
> >>>>
> >>>> To me so far "none_or_zero" really meant "will I consume more memory when
> >>>> collapsing". That's not true for zero-filled pages, obviously.
> >>>
> >>> Well, KSM can reclaim these zero-filled memory until we collapse it.
> >>
> >> KSM is used so rarely (for good reasons) that I would never ever build an
> >> argument based on its existence :P
> >>
> >> But yes: during the very first shrinker discussion I raised that KSM can do
> >> the same thing. Obviously that was not good enough.
> >>
> >> --
> >> Cheers
> >>
> >> David / dhildenb
> >>
> >
> > With all this stuff said, do we have an actual plan for what we intend to do
> > _now_?
>
> Oh no, no I have to use my brain and it's Friday evening.
>
> >
> > As Nico has implemented a basic solution here that we all seem to agree is not
> > what we want.
> >
> > Without needing special new hardware or major reworks, what would this parameter
> > look like?
> >
> > What would the heuristics be? What about the eagerness scales?
> >
> > I'm but a simple kernel developer,
>
> :)
>
> and interested in simple pragmatic stuff :)
> > do you have a plan right now David?
>
> Ehm, if you ask me that way ...
>
> >
> > Maybe we can start with something simple like a rough percentage per eagerness
> > entry that then gets scaled based on utilisation?
>
> ... I think we should probably:
>
> 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.
>
> 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well
I think the best design is to map to different max_ptes_none values,
0-5: 0,32,64,128,255,511
>
> 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
>
> 4) Solve world peace and world hunger
>
> 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever
>
>
> I maintain my initial position that just using
>
> max_ptes_none == 511 -> collapse mTHP always
> max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero
I think we should implement the eagerness toggle, and map it to
different max_pte_none values like I described above. This fits nicely
in the current collapse_max_ptes_none() function.
If we go with just 0/511 without the eagerness changes, we will be
removing configurability, only to reintroduce it again. When we can
leave the configurability from the start.
>
> As a starting point is probably simple and best, and likely leaves room for any
> changes later.
>
>
> Of course, we could do what Nico is proposing here, as 1) and change it all later.
I dont think this is much different than the eagerness approach; it
just compresses the max_ptes_none from 0-512 to 0-5/10.

I will wait for your RFC for the next version.

Does your implementation/thoughts align with what I describe above?
>
> It's just when it comes to documenting all that stuff in patch #15 that I feel like
> "alright, we shouldn't be doing it longterm like that, so let's not make anybody
> depend on any weird behavior here by over-domenting it".
>
> I mean
>
> "
> +To prevent "creeping" behavior where collapses continuously promote to larger
> +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> +that introducing more than half of the pages to be non-zero it will always
> +satisfy the eligibility check on the next scan and the region will be collapse.
> "
>
> Is just way, way to detailed.
>
> I would just say "The kernel might decide to use a more conservative approach
> when collapsing smaller THPs" etc.

Sounds good I can make it more ambiguous!

Cheers.
-- Nico
>
>
> Thoughts?
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 18:21                   ` Lorenzo Stoakes
@ 2025-09-13  0:28                     ` Nico Pache
  2025-09-15 10:44                       ` Lorenzo Stoakes
  2025-09-15 10:25                     ` David Hildenbrand
  1 sibling, 1 reply; 79+ messages in thread
From: Nico Pache @ 2025-09-13  0:28 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Kiryl Shutsemau, linux-mm, linux-doc,
	linux-kernel, linux-trace-kernel, ziy, baolin.wang, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:
> > On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > > With all this stuff said, do we have an actual plan for what we intend to do
> > > _now_?
> >
> > Oh no, no I have to use my brain and it's Friday evening.
>
> I apologise :)
>
> >
> > >
> > > As Nico has implemented a basic solution here that we all seem to agree is not
> > > what we want.
> > >
> > > Without needing special new hardware or major reworks, what would this parameter
> > > look like?
> > >
> > > What would the heuristics be? What about the eagerness scales?
> > >
> > > I'm but a simple kernel developer,
> >
> > :)
> >
> > and interested in simple pragmatic stuff :)
> > > do you have a plan right now David?
> >
> > Ehm, if you ask me that way ...
> >
> > >
> > > Maybe we can start with something simple like a rough percentage per eagerness
> > > entry that then gets scaled based on utilisation?
> >
> > ... I think we should probably:
> >
> > 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.
>
> Yes.
>
> >
> > 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well
>
> Yes I think we're all pretty onboard with that it seems!
>
> >
> > 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
>
> Right, I feel like we could start with some very simple linear thing here and
> later maybe refine it?

I agree, something like 0,32,64,128,255,511 seem to map well, and is
not too different from what im doing with the scaling by
(HPAGE_PMD_ORDER - order).

>
> >
> > 4) Solve world peace and world hunger
>
> Yes! That would be pretty great ;)
This should probably be a larger priority
>
> >
> > 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever
>
> I think these are TODOs :)
>
> >
> >
> > I maintain my initial position that just using
> >
> > max_ptes_none == 511 -> collapse mTHP always
> > max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero
> >
> > As a starting point is probably simple and best, and likely leaves room for any
> > changes later.
>
> Yes.
>
> >
> >
> > Of course, we could do what Nico is proposing here, as 1) and change it all later.
>
> Right.
>
> But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess
> given the 'creep' issue that's sensible.

I dont think thats much different to what david is trying to propose,
given eagerness=9 would be 50%.
at 10 or 511, no matter what, you will only ever collapse to the
largest enabled order.
The difference in my approach is that technically, with PMD disabled,
and 511, you would still need 50% utilization to collapse, which is
not ideal if you always want to collapse to some mTHP size even with 1
page occupied. With davids solution this is solved by never allowing
anything in between 255-511.

>
> >
> > It's just when it comes to documenting all that stuff in patch #15 that I feel like
> > "alright, we shouldn't be doing it longterm like that, so let's not make anybody
> > depend on any weird behavior here by over-domenting it".
> >
> > I mean
> >
> > "
> > +To prevent "creeping" behavior where collapses continuously promote to larger
> > +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> > +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> > +that introducing more than half of the pages to be non-zero it will always
> > +satisfy the eligibility check on the next scan and the region will be collapse.
> > "
> >
> > Is just way, way to detailed.
> >
> > I would just say "The kernel might decide to use a more conservative approach
> > when collapsing smaller THPs" etc.
> >
> >
> > Thoughts?
>
> Well I've sort of reviewed oppositely there :) well at least that it needs to be
> a hell of a lot clearer (I find that comment really compressed and I just don't
> really understand it).

I think your review is still valid to improve the internal code
comment. I think David is suggesting to not be so specific in the
actual admin-guide docs as we move towards a more opaque tunable.

>
> I guess I didn't think about people reading that and relying on it, so maybe we
> could alternatively make that succinct.
>
> But I think it'd be better to say something like "mTHP collapse cannot currently
> correctly function with half or more of the PTE entries empty, so we cap at just
> below this level" in this case.

Some middle ground might be the best answer, not too specific, but
also allude to the interworking a little.

Cheers,
-- Nico
>
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> Cheers, Lorenzo
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 23:31     ` Nico Pache
@ 2025-09-15  9:22       ` Kiryl Shutsemau
  2025-09-15 10:22         ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Kiryl Shutsemau @ 2025-09-15  9:22 UTC (permalink / raw)
  To: Nico Pache
  Cc: David Hildenbrand, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 05:31:51PM -0600, Nico Pache wrote:
> On Fri, Sep 12, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > >> The following series provides khugepaged with the capability to collapse
> > >> anonymous memory regions to mTHPs.
> > >>
> > >> To achieve this we generalize the khugepaged functions to no longer depend
> > >> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > >> pages that are occupied (!none/zero). After the PMD scan is done, we do
> > >> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > >> range. The restriction on max_ptes_none is removed during the scan, to make
> > >> sure we account for the whole PMD range. When no mTHP size is enabled, the
> > >> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > >> by the attempted collapse order to determine how full a mTHP must be to be
> > >> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > >> contains swapped out, or shared pages, we don't perform the collapse. It is
> > >> now also possible to collapse to mTHPs without requiring the PMD THP size
> > >> to be enabled.
> > >>
> > >> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > >> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > >> mTHP collapses to prevent collapse "creep" behavior. This prevents
> > >> constantly promoting mTHPs to the next available size, which would occur
> > >> because a collapse introduces more non-zero pages that would satisfy the
> > >> promotion condition on subsequent scans.
> > >
> > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > >
> >
> > I am all for not adding any more ugliness on top of all the ugliness we
> > added in the past.
> >
> > I will soon propose deprecating that parameter in favor of something
> > that makes a bit more sense.
> >
> > In essence, we'll likely have an "eagerness" parameter that ranges from
> > 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> > not all is populated".
> Hi David,
> 
> Do you have any reason for 0-10, I'm guessing these will map to
> different max_ptes_none values.
> I suggest 0-5, mapping to 0,32,64,128,255,511

That's too x86-64 specific.

And the whole idea is not to map to directly, but give kernel wiggle
room to play.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15  9:22       ` Kiryl Shutsemau
@ 2025-09-15 10:22         ` David Hildenbrand
  2025-09-15 10:35           ` Lorenzo Stoakes
                             ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:22 UTC (permalink / raw)
  To: Kiryl Shutsemau, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 11:22, Kiryl Shutsemau wrote:
> On Fri, Sep 12, 2025 at 05:31:51PM -0600, Nico Pache wrote:
>> On Fri, Sep 12, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>>>> The following series provides khugepaged with the capability to collapse
>>>>> anonymous memory regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>>>> by the attempted collapse order to determine how full a mTHP must be to be
>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>>>> to be enabled.
>>>>>
>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>>>> constantly promoting mTHPs to the next available size, which would occur
>>>>> because a collapse introduces more non-zero pages that would satisfy the
>>>>> promotion condition on subsequent scans.
>>>>
>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>>>>
>>>
>>> I am all for not adding any more ugliness on top of all the ugliness we
>>> added in the past.
>>>
>>> I will soon propose deprecating that parameter in favor of something
>>> that makes a bit more sense.
>>>
>>> In essence, we'll likely have an "eagerness" parameter that ranges from
>>> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
>>> not all is populated".
>> Hi David,
>>
>> Do you have any reason for 0-10, I'm guessing these will map to
>> different max_ptes_none values.
>> I suggest 0-5, mapping to 0,32,64,128,255,511
> 
> That's too x86-64 specific.
> 
> And the whole idea is not to map to directly, but give kernel wiggle
> room to play.

Initially we will start out simple and map it directly. But yeah, the 
idea is to give us some more room later.

I had something logarithmic in mind which would roughly be (ignoring the 
the weird -1 for simplicity and expressing it as "used" instead of 
none-or-zero)

0 -> ~100% used (~0% none)
1 -> ~50% used (~50% none)
2 -> ~25% used (~75% none)
3 -> ~12.5% used (~87.5% none)
4 -> ~11.25% used (~88,75% none)
...
10 -> ~0% used (~100% none)

Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 18:21                   ` Lorenzo Stoakes
  2025-09-13  0:28                     ` Nico Pache
@ 2025-09-15 10:25                     ` David Hildenbrand
  2025-09-15 10:32                       ` Lorenzo Stoakes
  1 sibling, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

>>
>> I would just say "The kernel might decide to use a more conservative approach
>> when collapsing smaller THPs" etc.
>>
>>
>> Thoughts?
> 
> Well I've sort of reviewed oppositely there :) well at least that it needs to be
> a hell of a lot clearer (I find that comment really compressed and I just don't
> really understand it).

Right. I think these are just details we should hide from the user. And 
in particular, not over-document it so we can more easily change 
semantics later.

> 
> I guess I didn't think about people reading that and relying on it, so maybe we
> could alternatively make that succinct.
> 
> But I think it'd be better to say something like "mTHP collapse cannot currently
> correctly function with half or more of the PTE entries empty, so we cap at just
> below this level" in this case.

IMHO we should just say that the value might be reduced for internal 
purposes and that this behavior might change in the future would likely 
be good enough.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function
  2025-09-12 23:26     ` Nico Pache
@ 2025-09-15 10:30       ` Lorenzo Stoakes
  0 siblings, 0 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:30 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kas, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 05:26:03PM -0600, Nico Pache wrote:
> On Fri, Sep 12, 2025 at 7:36 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Sep 11, 2025 at 09:28:01PM -0600, Nico Pache wrote:
> > > The current mechanism for determining mTHP collapse scales the
> > > khugepaged_max_ptes_none value based on the target order. This
> > > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > > is set to a value greater than HPAGE_PMD_NR / 2.
> > >
> > > With this configuration, a successful collapse to order N will populate
> > > enough pages to satisfy the collapse condition on order N+1 on the next
> > > scan. This leads to unnecessary work and memory churn.
> > >
> > > To fix this issue introduce a helper function that caps the max_ptes_none
> > > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> > > the max_ptes_none number by the (PMD_ORDER - target collapse order).
> >
> > I would say very clearly that this is only in the mTHP case.
>
> ack, I stole most of the verbiage here from other notes I've
> previously written, but it can be improved.

Thanks.

>
> >
> >
> > >
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> >
> > Hmm I thought we were going to wait for David to investigate different
> > approaches to this?
> >
> > This is another issue with quickly going to another iteration. Though I do think
> > David explicitly said he'd come back with a solution?
>
> Sorry I thought that was being done in lockstep. The last version was
> about a month ago and I had a lot of changes queued up. Now that we
> have collapse_max_pte_none() David has a much easier entry point to
> work off :)

It'd be less problematic if this was an RFC but better to ensure all discussion
is really complete before next revision for everybody's sanity.

The ideal solution here would be to just ask David if he minded you respinning
before that was resovled I think.

But I do think generally, as I said in [0] that it'd make our lives easier if
you left perhaps a day or two after the conversation has settled just to be sure
that's that, and obviously directly ask about things like this.

I can only politely ask for this, obviously you're free to do whatever... :)

[0]: https://lore.kernel.org/all/2d5270e4-0de3-4ea3-87a4-96254eb6d446@lucifer.local/

>
> I think he will still need this groundwork for the solution he is
> working on with "eagerness". if 10 -> 511, and 9 ->255, ..., 0 -> 0.
> It will still have to do the scaling. Although I believe 0-10 should
> be more like 0-5 mapping to 0,32,64,128,255,511

Yeah, let's leave that discussion to the subthread on that.

>
> >
> > So I'm not sure why we're seeing this solution here? Unless I'm missing
> > something?
> >
> > > ---
> > >  mm/khugepaged.c | 22 +++++++++++++++++++++-
> > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index b0ae0b63fc9b..4587f2def5c1 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -468,6 +468,26 @@ void __khugepaged_enter(struct mm_struct *mm)
> > >               wake_up_interruptible(&khugepaged_wait);
> > >  }
> > >
> > > +/* Returns the scaled max_ptes_none for a given order.
> >
> > We don't start comments at the /*, please use a normal comment format like:
> ack

Thanks

> >
> > /*
> >  * xxxx
> >  */
> >
> > > + * Caps the value to HPAGE_PMD_NR/2 - 1 in the case of mTHP collapse to prevent
> >
> > This is super unclear.
> >
> > It start with 'caps the xxx' which seems like you're talking generally.
> >
> > You should say very clearly 'For PMD allocations we apply the
> > khugepaged_max_ptes_none parameter as normal. For mTHP ... [details about mTHP].

> ack I will clean this up.

Thanks.

> >
> > > + * a feedback loop. If max_ptes_none is greater than HPAGE_PMD_NR/2, the value
> > > + * would lead to collapses that introduces 2x more pages than the original
> > > + * number of pages. On subsequent scans, the max_ptes_none check would be
> > > + * satisfied and the collapses would continue until the largest order is reached
> > > + */
> >
> > This is a super vauge explanation. Please describe the issue with creep more
> > clearly.

> ok I will try to come up with something clearer.

Thanks.

> >
> > Also aren't we saying that 511 or 0 are the sensible choices? But now somehow
> > that's not the case?
> Oh I stated I wanted to propose this, and although there was some
> pushback I still thought it deserved another attempt. This still
> allows for some configurability, and with David's eagerness toggle
> this still seems to fit nicely.
> >
> > You're also not giving a kdoc info on what this returns.
> Ok I'll add a kdoc here, why this function in particular, I'm trying
> to understand why we dont add kdocs on other functions?
> >
> > > +static int collapse_max_ptes_none(unsigned int order)
> >
> > It's a problem that existed already, but khugepaged_max_ptes_none is an unsigned
> > int and this returns int.
> >
> > Maybe we should fix this while we're at it...

> ack

Thanks

> >
> > > +{
> > > +     int max_ptes_none;
> > > +
> > > +     if (order != HPAGE_PMD_ORDER &&
> > > +         khugepaged_max_ptes_none >= HPAGE_PMD_NR/2)
> > > +             max_ptes_none = HPAGE_PMD_NR/2 - 1;
> > > +     else
> > > +             max_ptes_none = khugepaged_max_ptes_none;
> > > +     return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > > +
> > > +}
> > > +
> >
> > I really don't like this formulation, you're making it unnecessarily unclear and
> > now, for the super common case of PMD size, you have to figure out 'oh it's this
> > second branch and we're subtracting HPAGE_PMD_ORDER from HPAGE_PMD_ORDER so just
> > return khugepaged_max_ptes_none'. When we could... just return it no?
> >
> > So something like:
> >
> > #define MAX_PTES_NONE_MTHP_CAP (HPAGE_PMD_NR / 2 - 1)
> >
> > static unsigned int collapse_max_ptes_none(unsigned int order)
> > {
> >         unsigned int max_ptes_none_pmd;
> >
> >         /* PMD-sized THPs behave precisely the same as before. */
> >         if (order == HPAGE_PMD_ORDER)
> >                 return khugepaged_max_ptes_none;
> >
> >         /*
> >         * Bizarrely, this is expressed in terms of PTEs were this PMD-sized.
> >         * For the reasons stated above, we cap this value in the case of mTHP.
> >         */
> >         max_ptes_none_pmd = MIN(MAX_PTES_NONE_MTHP_CAP,
> >                 khugepaged_max_ptes_none);
> >
> >         /* Apply PMD -> mTHP scaling. */
> >         return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > }

> yeah that's much cleaner thanks!

:) Cool thanks!

> >
> > >  void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >                         vm_flags_t vm_flags)
> > >  {
> > > @@ -554,7 +574,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > >       struct folio *folio = NULL;
> > >       pte_t *_pte;
> > >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> > > -     int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > > +     int scaled_max_ptes_none = collapse_max_ptes_none(order);
> > >       const unsigned long nr_pages = 1UL << order;
> > >
> > >       for (_pte = pte; _pte < pte + nr_pages;
> > > --
> > > 2.51.0
> > >
> >
> > Thanks, Lorenzo
> >
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:25                     ` David Hildenbrand
@ 2025-09-15 10:32                       ` Lorenzo Stoakes
  2025-09-15 10:37                         ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 12:25:54PM +0200, David Hildenbrand wrote:
> > >
> > > I would just say "The kernel might decide to use a more conservative approach
> > > when collapsing smaller THPs" etc.
> > >
> > >
> > > Thoughts?
> >
> > Well I've sort of reviewed oppositely there :) well at least that it needs to be
> > a hell of a lot clearer (I find that comment really compressed and I just don't
> > really understand it).
>
> Right. I think these are just details we should hide from the user. And in
> particular, not over-document it so we can more easily change semantics
> later.

And when we change semantics we can't change comments?

I mean maybe we're talking across purposes here, I'm talking about code
comments, not the documentation.

I agree the documentation should not mention any of this.

>
> >
> > I guess I didn't think about people reading that and relying on it, so maybe we
> > could alternatively make that succinct.
> >
> > But I think it'd be better to say something like "mTHP collapse cannot currently
> > correctly function with half or more of the PTE entries empty, so we cap at just
> > below this level" in this case.
>
> IMHO we should just say that the value might be reduced for internal
> purposes and that this behavior might change in the future would likely be
> good enough.

Again, I assume you mean documentation rather than comments?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:22         ` David Hildenbrand
@ 2025-09-15 10:35           ` Lorenzo Stoakes
  2025-09-15 10:39             ` David Hildenbrand
  2025-09-15 10:40             ` Lorenzo Stoakes
  2025-09-15 10:43           ` Lorenzo Stoakes
  2025-09-15 11:41           ` Nico Pache
  2 siblings, 2 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
> On 15.09.25 11:22, Kiryl Shutsemau wrote:
> > On Fri, Sep 12, 2025 at 05:31:51PM -0600, Nico Pache wrote:
> > > On Fri, Sep 12, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 12.09.25 14:19, Kiryl Shutsemau wrote:
> > > > > On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> > > > > > The following series provides khugepaged with the capability to collapse
> > > > > > anonymous memory regions to mTHPs.
> > > > > >
> > > > > > To achieve this we generalize the khugepaged functions to no longer depend
> > > > > > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> > > > > > pages that are occupied (!none/zero). After the PMD scan is done, we do
> > > > > > binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> > > > > > range. The restriction on max_ptes_none is removed during the scan, to make
> > > > > > sure we account for the whole PMD range. When no mTHP size is enabled, the
> > > > > > legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> > > > > > by the attempted collapse order to determine how full a mTHP must be to be
> > > > > > eligible for the collapse to occur. If a mTHP collapse is attempted, but
> > > > > > contains swapped out, or shared pages, we don't perform the collapse. It is
> > > > > > now also possible to collapse to mTHPs without requiring the PMD THP size
> > > > > > to be enabled.
> > > > > >
> > > > > > When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> > > > > > 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> > > > > > mTHP collapses to prevent collapse "creep" behavior. This prevents
> > > > > > constantly promoting mTHPs to the next available size, which would occur
> > > > > > because a collapse introduces more non-zero pages that would satisfy the
> > > > > > promotion condition on subsequent scans.
> > > > >
> > > > > Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> > > > > all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> > > > >
> > > >
> > > > I am all for not adding any more ugliness on top of all the ugliness we
> > > > added in the past.
> > > >
> > > > I will soon propose deprecating that parameter in favor of something
> > > > that makes a bit more sense.
> > > >
> > > > In essence, we'll likely have an "eagerness" parameter that ranges from
> > > > 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> > > > not all is populated".
> > > Hi David,
> > >
> > > Do you have any reason for 0-10, I'm guessing these will map to
> > > different max_ptes_none values.
> > > I suggest 0-5, mapping to 0,32,64,128,255,511
> >
> > That's too x86-64 specific.
> >
> > And the whole idea is not to map to directly, but give kernel wiggle
> > room to play.
>
> Initially we will start out simple and map it directly. But yeah, the idea
> is to give us some more room later.

I think it's less 'wiggle room' and more us being able to _abstract_ what this
measurement means while reserving the right to adjust this.

But maybe we are saying the same thing in different ways.

>
> I had something logarithmic in mind which would roughly be (ignoring the the
> weird -1 for simplicity and expressing it as "used" instead of none-or-zero)
>
> 0 -> ~100% used (~0% none)

So equivalent to 511 today?

> 1 -> ~50% used (~50% none)
> 2 -> ~25% used (~75% none)
> 3 -> ~12.5% used (~87.5% none)
> 4 -> ~11.25% used (~88,75% none)
> ...
> 10 -> ~0% used (~100% none)

So equivalent to 0 today?

And with a logarithmic weighting towards values closer to "0% used"?

This seems sensible given the only reports we've had of non-0/511 uses here are
in that range...

But ofc this interpretation should be something we determine + treated as an
implementation detail that we can modify later.

>
> Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.

And at different mTHP levels too right?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:32                       ` Lorenzo Stoakes
@ 2025-09-15 10:37                         ` David Hildenbrand
  2025-09-15 10:46                           ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 12:32, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 12:25:54PM +0200, David Hildenbrand wrote:
>>>>
>>>> I would just say "The kernel might decide to use a more conservative approach
>>>> when collapsing smaller THPs" etc.
>>>>
>>>>
>>>> Thoughts?
>>>
>>> Well I've sort of reviewed oppositely there :) well at least that it needs to be
>>> a hell of a lot clearer (I find that comment really compressed and I just don't
>>> really understand it).
>>
>> Right. I think these are just details we should hide from the user. And in
>> particular, not over-document it so we can more easily change semantics
>> later.
> 
> And when we change semantics we can't change comments?
> 
> I mean maybe we're talking across purposes here, I'm talking about code
> comments, not the documentation.
> 
> I agree the documentation should not mention any of this.

Yes, I was talking about patch #15 ("It's just when it comes to 
documenting all that stuff in patch #15").

Comments we can adjust as we please of course :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:35           ` Lorenzo Stoakes
@ 2025-09-15 10:39             ` David Hildenbrand
  2025-09-15 10:40             ` Lorenzo Stoakes
  1 sibling, 0 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 12:35, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
>> On 15.09.25 11:22, Kiryl Shutsemau wrote:
>>> On Fri, Sep 12, 2025 at 05:31:51PM -0600, Nico Pache wrote:
>>>> On Fri, Sep 12, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>>>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>>>>>> The following series provides khugepaged with the capability to collapse
>>>>>>> anonymous memory regions to mTHPs.
>>>>>>>
>>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>>>>>> by the attempted collapse order to determine how full a mTHP must be to be
>>>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>>>>>> to be enabled.
>>>>>>>
>>>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>>>>>> constantly promoting mTHPs to the next available size, which would occur
>>>>>>> because a collapse introduces more non-zero pages that would satisfy the
>>>>>>> promotion condition on subsequent scans.
>>>>>>
>>>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>>>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>>>>>>
>>>>>
>>>>> I am all for not adding any more ugliness on top of all the ugliness we
>>>>> added in the past.
>>>>>
>>>>> I will soon propose deprecating that parameter in favor of something
>>>>> that makes a bit more sense.
>>>>>
>>>>> In essence, we'll likely have an "eagerness" parameter that ranges from
>>>>> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
>>>>> not all is populated".
>>>> Hi David,
>>>>
>>>> Do you have any reason for 0-10, I'm guessing these will map to
>>>> different max_ptes_none values.
>>>> I suggest 0-5, mapping to 0,32,64,128,255,511
>>>
>>> That's too x86-64 specific.
>>>
>>> And the whole idea is not to map to directly, but give kernel wiggle
>>> room to play.
>>
>> Initially we will start out simple and map it directly. But yeah, the idea
>> is to give us some more room later.
> 
> I think it's less 'wiggle room' and more us being able to _abstract_ what this
> measurement means while reserving the right to adjust this.
> 
> But maybe we are saying the same thing in different ways.
> 
>>
>> I had something logarithmic in mind which would roughly be (ignoring the the
>> weird -1 for simplicity and expressing it as "used" instead of none-or-zero)
>>
>> 0 -> ~100% used (~0% none)
> 
> So equivalent to 511 today?
> 
>> 1 -> ~50% used (~50% none)
>> 2 -> ~25% used (~75% none)
>> 3 -> ~12.5% used (~87.5% none)
>> 4 -> ~11.25% used (~88,75% none)
>> ...
>> 10 -> ~0% used (~100% none)
> 
> So equivalent to 0 today?

Yes.

> 
> And with a logarithmic weighting towards values closer to "0% used"?
> 
> This seems sensible given the only reports we've had of non-0/511 uses here are
> in that range...
> 
> But ofc this interpretation should be something we determine + treated as an
> implementation detail that we can modify later.
> 
>>
>> Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
> 
> And at different mTHP levels too right?

Yes exactly.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:35           ` Lorenzo Stoakes
  2025-09-15 10:39             ` David Hildenbrand
@ 2025-09-15 10:40             ` Lorenzo Stoakes
  2025-09-15 10:44               ` David Hildenbrand
  1 sibling, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 11:35:53AM +0100, Lorenzo Stoakes wrote:
 > On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
> > Initially we will start out simple and map it directly. But yeah, the idea
> > is to give us some more room later.
>
> I think it's less 'wiggle room' and more us being able to _abstract_ what this
> measurement means while reserving the right to adjust this.
>
> But maybe we are saying the same thing in different ways.
>
> >
> > I had something logarithmic in mind which would roughly be (ignoring the the
> > weird -1 for simplicity and expressing it as "used" instead of none-or-zero)
> >
> > 0 -> ~100% used (~0% none)
>
> So equivalent to 511 today?
>
> > 1 -> ~50% used (~50% none)
> > 2 -> ~25% used (~75% none)
> > 3 -> ~12.5% used (~87.5% none)
> > 4 -> ~11.25% used (~88,75% none)
> > ...
> > 10 -> ~0% used (~100% none)
>
> So equivalent to 0 today?
>
> And with a logarithmic weighting towards values closer to "0% used"?
>
> This seems sensible given the only reports we've had of non-0/511 uses here are
> in that range...
>
> But ofc this interpretation should be something we determine + treated as an
> implementation detail that we can modify later.
>
> >
> > Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
>
> And at different mTHP levels too right?
>

Another point here, since we have to keep:

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none

Around, and users will try to set values there, presumably we will now add:

/sys/kernel/mm/transparent_hugepage/khugepaged/eagerness

How will we map <-> the two tunables?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:22         ` David Hildenbrand
  2025-09-15 10:35           ` Lorenzo Stoakes
@ 2025-09-15 10:43           ` Lorenzo Stoakes
  2025-09-15 10:52             ` David Hildenbrand
  2025-09-15 11:41           ` Nico Pache
  2 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
>
> 0 -> ~100% used (~0% none)
> 1 -> ~50% used (~50% none)
> 2 -> ~25% used (~75% none)
> 3 -> ~12.5% used (~87.5% none)
> 4 -> ~11.25% used (~88,75% none)
> ...
> 10 -> ~0% used (~100% none)

Oh and shouldn't this be inverted?

0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
10 eagerness = we aren't eager to eat up none PTE entries at all?

Or am I being dumb here?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:40             ` Lorenzo Stoakes
@ 2025-09-15 10:44               ` David Hildenbrand
  2025-09-15 10:48                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

>>> Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
>>
>> And at different mTHP levels too right?
>>
> 
> Another point here, since we have to keep:
> 
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> 
> Around, and users will try to set values there, presumably we will now add:
> 
> /sys/kernel/mm/transparent_hugepage/khugepaged/eagerness
> 
> How will we map <-> the two tunables?

Well, the easy case if someone updates eagerness, then we simply et it 
to whatever magic value we compute and document.

The other direction is more problematic, likely we'll simply warn and do 
something reasonable (map it to whatever eagerness scale is closest or 
simply indicate it as "-1" -- user intervened or sth like that)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-13  0:28                     ` Nico Pache
@ 2025-09-15 10:44                       ` Lorenzo Stoakes
  0 siblings, 0 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:44 UTC (permalink / raw)
  To: Nico Pache
  Cc: David Hildenbrand, Kiryl Shutsemau, linux-mm, linux-doc,
	linux-kernel, linux-trace-kernel, ziy, baolin.wang, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

On Fri, Sep 12, 2025 at 06:28:55PM -0600, Nico Pache wrote:
> On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:
> > > On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > > > With all this stuff said, do we have an actual plan for what we intend to do
> > > > _now_?
> > >
> > > Oh no, no I have to use my brain and it's Friday evening.
> >
> > I apologise :)
> >
> > >
> > > >
> > > > As Nico has implemented a basic solution here that we all seem to agree is not
> > > > what we want.
> > > >
> > > > Without needing special new hardware or major reworks, what would this parameter
> > > > look like?
> > > >
> > > > What would the heuristics be? What about the eagerness scales?
> > > >
> > > > I'm but a simple kernel developer,
> > >
> > > :)
> > >
> > > and interested in simple pragmatic stuff :)
> > > > do you have a plan right now David?
> > >
> > > Ehm, if you ask me that way ...
> > >
> > > >
> > > > Maybe we can start with something simple like a rough percentage per eagerness
> > > > entry that then gets scaled based on utilisation?
> > >
> > > ... I think we should probably:
> > >
> > > 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.
> >
> > Yes.
> >
> > >
> > > 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well
> >
> > Yes I think we're all pretty onboard with that it seems!
> >
> > >
> > > 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
> >
> > Right, I feel like we could start with some very simple linear thing here and
> > later maybe refine it?
>
> I agree, something like 0,32,64,128,255,511 seem to map well, and is
> not too different from what im doing with the scaling by
> (HPAGE_PMD_ORDER - order).

Actually, I suspect something like what David suggests in [0] is probably the
better way, but as I said there I think it should be an internal implementation
detail as to what this ultimately ends up being.

The idea is we provide an abstract thing a user can set, and the kernel figures
out how best to interpret that.

[0]:https://lore.kernel.org/linux-mm/cd8e7f1c-a563-4ae9-a0fb-b0d04a4c35b4@redhat.com/

>
> >
> > >
> > > 4) Solve world peace and world hunger
> >
> > Yes! That would be pretty great ;)
> This should probably be a larger priority

:)))

> >
> > >
> > > 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever
> >
> > I think these are TODOs :)
> >
> > >
> > >
> > > I maintain my initial position that just using
> > >
> > > max_ptes_none == 511 -> collapse mTHP always
> > > max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero
> > >
> > > As a starting point is probably simple and best, and likely leaves room for any
> > > changes later.
> >
> > Yes.
> >
> > >
> > >
> > > Of course, we could do what Nico is proposing here, as 1) and change it all later.
> >
> > Right.
> >
> > But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess
> > given the 'creep' issue that's sensible.
>
> I dont think thats much different to what david is trying to propose,
> given eagerness=9 would be 50%.

I think q

> at 10 or 511, no matter what, you will only ever collapse to the
> largest enabled order.
> The difference in my approach is that technically, with PMD disabled,
> and 511, you would still need 50% utilization to collapse, which is
> not ideal if you always want to collapse to some mTHP size even with 1
> page occupied. With davids solution this is solved by never allowing
> anything in between 255-511.

Right. Except we default to max eagerness (or min, I asked David about the
values there :P)

So aren't we, by default, broken on mTHP? Maybe we can change the default though...

>
> >
> > >
> > > It's just when it comes to documenting all that stuff in patch #15 that I feel like
> > > "alright, we shouldn't be doing it longterm like that, so let's not make anybody
> > > depend on any weird behavior here by over-domenting it".
> > >
> > > I mean
> > >
> > > "
> > > +To prevent "creeping" behavior where collapses continuously promote to larger
> > > +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> > > +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> > > +that introducing more than half of the pages to be non-zero it will always
> > > +satisfy the eligibility check on the next scan and the region will be collapse.
> > > "
> > >
> > > Is just way, way to detailed.
> > >
> > > I would just say "The kernel might decide to use a more conservative approach
> > > when collapsing smaller THPs" etc.
> > >
> > >
> > > Thoughts?
> >
> > Well I've sort of reviewed oppositely there :) well at least that it needs to be
> > a hell of a lot clearer (I find that comment really compressed and I just don't
> > really understand it).
>
> I think your review is still valid to improve the internal code
> comment. I think David is suggesting to not be so specific in the
> actual admin-guide docs as we move towards a more opaque tunable.

Yeah thanks for pointing that out! We were talking across purposes.

>
> >
> > I guess I didn't think about people reading that and relying on it, so maybe we
> > could alternatively make that succinct.
> >
> > But I think it'd be better to say something like "mTHP collapse cannot currently
> > correctly function with half or more of the PTE entries empty, so we cap at just
> > below this level" in this case.
>
> Some middle ground might be the best answer, not too specific, but
> also allude to the interworking a little.

Yeah actually I agree with David re: documentation, my comments were wrt
err... comments :P only.

>
> Cheers,
> -- Nico

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:37                         ` David Hildenbrand
@ 2025-09-15 10:46                           ` Lorenzo Stoakes
  0 siblings, 0 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 12:37:28PM +0200, David Hildenbrand wrote:
> On 15.09.25 12:32, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 12:25:54PM +0200, David Hildenbrand wrote:
> > > > >
> > > > > I would just say "The kernel might decide to use a more conservative approach
> > > > > when collapsing smaller THPs" etc.
> > > > >
> > > > >
> > > > > Thoughts?
> > > >
> > > > Well I've sort of reviewed oppositely there :) well at least that it needs to be
> > > > a hell of a lot clearer (I find that comment really compressed and I just don't
> > > > really understand it).
> > >
> > > Right. I think these are just details we should hide from the user. And in
> > > particular, not over-document it so we can more easily change semantics
> > > later.
> >
> > And when we change semantics we can't change comments?
> >
> > I mean maybe we're talking across purposes here, I'm talking about code
> > comments, not the documentation.
> >
> > I agree the documentation should not mention any of this.
>
> Yes, I was talking about patch #15 ("It's just when it comes to documenting
> all that stuff in patch #15").
>
> Comments we can adjust as we please of course :)

Yeah, ok this was just a misunderstanding then! :) We are in agreement.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:44               ` David Hildenbrand
@ 2025-09-15 10:48                 ` Lorenzo Stoakes
  2025-09-15 10:52                   ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 12:44:34PM +0200, David Hildenbrand wrote:
> > > > Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
> > >
> > > And at different mTHP levels too right?
> > >
> >
> > Another point here, since we have to keep:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> >
> > Around, and users will try to set values there, presumably we will now add:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/eagerness
> >
> > How will we map <-> the two tunables?
>
> Well, the easy case if someone updates eagerness, then we simply et it to
> whatever magic value we compute and document.
>
> The other direction is more problematic, likely we'll simply warn and do
> something reasonable (map it to whatever eagerness scale is closest or
> simply indicate it as "-1" -- user intervened or sth like that)

I don't love the idea of a -1 situation, as that's going to create some
confusion.

I'd really rather we just say out and out 'the kernel decides this based on
eagerness'.

So either warn or have some method to reverse-engineer what the closest value
might be.

Or perhaps just accept 0/511 there and map to eagerness min/max + if non-0/511
warn?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:43           ` Lorenzo Stoakes
@ 2025-09-15 10:52             ` David Hildenbrand
  2025-09-15 11:02               ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 12:43, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
>>
>> 0 -> ~100% used (~0% none)
>> 1 -> ~50% used (~50% none)
>> 2 -> ~25% used (~75% none)
>> 3 -> ~12.5% used (~87.5% none)
>> 4 -> ~11.25% used (~88,75% none)
>> ...
>> 10 -> ~0% used (~100% none)
> 
> Oh and shouldn't this be inverted?
> 
> 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
> 10 eagerness = we aren't eager to eat up none PTE entries at all?
> 
> Or am I being dumb here?

Good question.

For swappiness it's: 0 -> no swap (conservative)

So intuitively I assumed: 0 -> no pte_none (conservative)

You're the native speaker, so you tell me :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:48                 ` Lorenzo Stoakes
@ 2025-09-15 10:52                   ` David Hildenbrand
  2025-09-15 10:59                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 12:48, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 12:44:34PM +0200, David Hildenbrand wrote:
>>>>> Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
>>>>
>>>> And at different mTHP levels too right?
>>>>
>>>
>>> Another point here, since we have to keep:
>>>
>>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>
>>> Around, and users will try to set values there, presumably we will now add:
>>>
>>> /sys/kernel/mm/transparent_hugepage/khugepaged/eagerness
>>>
>>> How will we map <-> the two tunables?
>>
>> Well, the easy case if someone updates eagerness, then we simply et it to
>> whatever magic value we compute and document.
>>
>> The other direction is more problematic, likely we'll simply warn and do
>> something reasonable (map it to whatever eagerness scale is closest or
>> simply indicate it as "-1" -- user intervened or sth like that)
> 
> I don't love the idea of a -1 situation, as that's going to create some
> confusion.

swapiness also has a "max" parameter, so we could just say "override" /" 
disabled" / whatever?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:52                   ` David Hildenbrand
@ 2025-09-15 10:59                     ` Lorenzo Stoakes
  2025-09-15 11:10                       ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 12:52:53PM +0200, David Hildenbrand wrote:
> On 15.09.25 12:48, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 12:44:34PM +0200, David Hildenbrand wrote:
> > > > > > Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
> > > > >
> > > > > And at different mTHP levels too right?
> > > > >
> > > >
> > > > Another point here, since we have to keep:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> > > >
> > > > Around, and users will try to set values there, presumably we will now add:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/eagerness
> > > >
> > > > How will we map <-> the two tunables?
> > >
> > > Well, the easy case if someone updates eagerness, then we simply et it to
> > > whatever magic value we compute and document.
> > >
> > > The other direction is more problematic, likely we'll simply warn and do
> > > something reasonable (map it to whatever eagerness scale is closest or
> > > simply indicate it as "-1" -- user intervened or sth like that)
> >
> > I don't love the idea of a -1 situation, as that's going to create some
> > confusion.
>
> swapiness also has a "max" parameter, so we could just say "override" /"
> disabled" / whatever?

I don't love the user being able to override this though, let's just nuke their
ability to set this pleeeease.

Because if they can override it, then we have to do some deeply nasty scaling
for mTHP again.

Would really prefer us to only accept 0/511 + warn on anything else.

We could put the warning in a cycle before we land the change also + just take
Nico's current version for now.

So that way people are aware it's coming...

(Could also put in docs ofc)

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:52             ` David Hildenbrand
@ 2025-09-15 11:02               ` Lorenzo Stoakes
  2025-09-15 11:14                 ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 11:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
> On 15.09.25 12:43, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
> > >
> > > 0 -> ~100% used (~0% none)
> > > 1 -> ~50% used (~50% none)
> > > 2 -> ~25% used (~75% none)
> > > 3 -> ~12.5% used (~87.5% none)
> > > 4 -> ~11.25% used (~88,75% none)
> > > ...
> > > 10 -> ~0% used (~100% none)
> >
> > Oh and shouldn't this be inverted?
> >
> > 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
> > 10 eagerness = we aren't eager to eat up none PTE entries at all?
> >
> > Or am I being dumb here?
>
> Good question.
>
> For swappiness it's: 0 -> no swap (conservative)
>
> So intuitively I assumed: 0 -> no pte_none (conservative)
>
> You're the native speaker, so you tell me :)

To me this is about 'eagerness to consume empty PTE entries' so 10 is more
eager, 0 is not eager at all, i.e. inversion of what you suggest :)

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:59                     ` Lorenzo Stoakes
@ 2025-09-15 11:10                       ` David Hildenbrand
  2025-09-15 11:13                         ` Lorenzo Stoakes
  2025-09-15 12:16                         ` Usama Arif
  0 siblings, 2 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 11:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 12:59, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 12:52:53PM +0200, David Hildenbrand wrote:
>> On 15.09.25 12:48, Lorenzo Stoakes wrote:
>>> On Mon, Sep 15, 2025 at 12:44:34PM +0200, David Hildenbrand wrote:
>>>>>>> Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
>>>>>>
>>>>>> And at different mTHP levels too right?
>>>>>>
>>>>>
>>>>> Another point here, since we have to keep:
>>>>>
>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>>
>>>>> Around, and users will try to set values there, presumably we will now add:
>>>>>
>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/eagerness
>>>>>
>>>>> How will we map <-> the two tunables?
>>>>
>>>> Well, the easy case if someone updates eagerness, then we simply et it to
>>>> whatever magic value we compute and document.
>>>>
>>>> The other direction is more problematic, likely we'll simply warn and do
>>>> something reasonable (map it to whatever eagerness scale is closest or
>>>> simply indicate it as "-1" -- user intervened or sth like that)
>>>
>>> I don't love the idea of a -1 situation, as that's going to create some
>>> confusion.
>>
>> swapiness also has a "max" parameter, so we could just say "override" /"
>> disabled" / whatever?
> 
> I don't love the user being able to override this though, let's just nuke their
> ability to set this pleeeease.
> 
> Because if they can override it, then we have to do some deeply nasty scaling
> for mTHP again.

There are ways to have it working internally, just using a different 
"scale" instead of the 100 -> 50 -> 25 etc.

I am afraid we cannot change the parameter to ignore other values 
because of the interaction with the shrinker that easily .... we might 
be able to detracted at wait a bunch of kernel releases probably.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:10                       ` David Hildenbrand
@ 2025-09-15 11:13                         ` Lorenzo Stoakes
  2025-09-15 11:16                           ` David Hildenbrand
  2025-09-15 12:16                         ` Usama Arif
  1 sibling, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 11:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 01:10:22PM +0200, David Hildenbrand wrote:
> On 15.09.25 12:59, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 12:52:53PM +0200, David Hildenbrand wrote:
> > > On 15.09.25 12:48, Lorenzo Stoakes wrote:
> > > > On Mon, Sep 15, 2025 at 12:44:34PM +0200, David Hildenbrand wrote:
> > > > > > > > Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
> > > > > > >
> > > > > > > And at different mTHP levels too right?
> > > > > > >
> > > > > >
> > > > > > Another point here, since we have to keep:
> > > > > >
> > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> > > > > >
> > > > > > Around, and users will try to set values there, presumably we will now add:
> > > > > >
> > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/eagerness
> > > > > >
> > > > > > How will we map <-> the two tunables?
> > > > >
> > > > > Well, the easy case if someone updates eagerness, then we simply et it to
> > > > > whatever magic value we compute and document.
> > > > >
> > > > > The other direction is more problematic, likely we'll simply warn and do
> > > > > something reasonable (map it to whatever eagerness scale is closest or
> > > > > simply indicate it as "-1" -- user intervened or sth like that)
> > > >
> > > > I don't love the idea of a -1 situation, as that's going to create some
> > > > confusion.
> > >
> > > swapiness also has a "max" parameter, so we could just say "override" /"
> > > disabled" / whatever?
> >
> > I don't love the user being able to override this though, let's just nuke their
> > ability to set this pleeeease.
> >
> > Because if they can override it, then we have to do some deeply nasty scaling
> > for mTHP again.
>
> There are ways to have it working internally, just using a different "scale"
> instead of the 100 -> 50 -> 25 etc.

Right. I mean with the exponential scale we could just algorithimically figure
out what the eagerness should be.

>
> I am afraid we cannot change the parameter to ignore other values because of
> the interaction with the shrinker that easily .... we might be able to
> detracted at wait a bunch of kernel releases probably.

:(

BTW 'Detracted at wait'? :P You mean we might be able to remove after a few
releases?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:02               ` Lorenzo Stoakes
@ 2025-09-15 11:14                 ` David Hildenbrand
  2025-09-15 11:23                   ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 11:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 13:02, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
>> On 15.09.25 12:43, Lorenzo Stoakes wrote:
>>> On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
>>>>
>>>> 0 -> ~100% used (~0% none)
>>>> 1 -> ~50% used (~50% none)
>>>> 2 -> ~25% used (~75% none)
>>>> 3 -> ~12.5% used (~87.5% none)
>>>> 4 -> ~11.25% used (~88,75% none)
>>>> ...
>>>> 10 -> ~0% used (~100% none)
>>>
>>> Oh and shouldn't this be inverted?
>>>
>>> 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
>>> 10 eagerness = we aren't eager to eat up none PTE entries at all?
>>>
>>> Or am I being dumb here?
>>
>> Good question.
>>
>> For swappiness it's: 0 -> no swap (conservative)
>>
>> So intuitively I assumed: 0 -> no pte_none (conservative)
>>
>> You're the native speaker, so you tell me :)
> 
> To me this is about 'eagerness to consume empty PTE entries' so 10 is more
> eager, 0 is not eager at all, i.e. inversion of what you suggest :)

Just so we are on the same page: it is about "eagerness to collapse", right?

Wouldn't a 0 mean "I am not eager, I will not waste any memory, I am 
very careful and bail out on any pte_none" vs. 10 meaning "I am very 
eager, I will collapse no matter what I find in the page table, waste as 
much memory as I want"?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:13                         ` Lorenzo Stoakes
@ 2025-09-15 11:16                           ` David Hildenbrand
  0 siblings, 0 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 11:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

>> I am afraid we cannot change the parameter to ignore other values because of
>> the interaction with the shrinker that easily .... we might be able to
>> detracted at wait a bunch of kernel releases probably.
> 
> :(
> 
> BTW 'Detracted at wait'? :P You mean we might be able to remove after a few
> releases?


"deprecate and wait" :)


-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:14                 ` David Hildenbrand
@ 2025-09-15 11:23                   ` Lorenzo Stoakes
  2025-09-15 11:29                     ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 11:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 01:14:32PM +0200, David Hildenbrand wrote:
> On 15.09.25 13:02, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
> > > On 15.09.25 12:43, Lorenzo Stoakes wrote:
> > > > On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
> > > > >
> > > > > 0 -> ~100% used (~0% none)
> > > > > 1 -> ~50% used (~50% none)
> > > > > 2 -> ~25% used (~75% none)
> > > > > 3 -> ~12.5% used (~87.5% none)
> > > > > 4 -> ~11.25% used (~88,75% none)
> > > > > ...
> > > > > 10 -> ~0% used (~100% none)
> > > >
> > > > Oh and shouldn't this be inverted?
> > > >
> > > > 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
> > > > 10 eagerness = we aren't eager to eat up none PTE entries at all?
> > > >
> > > > Or am I being dumb here?
> > >
> > > Good question.
> > >
> > > For swappiness it's: 0 -> no swap (conservative)
> > >
> > > So intuitively I assumed: 0 -> no pte_none (conservative)
> > >
> > > You're the native speaker, so you tell me :)
> >
> > To me this is about 'eagerness to consume empty PTE entries' so 10 is more
> > eager, 0 is not eager at all, i.e. inversion of what you suggest :)
>
> Just so we are on the same page: it is about "eagerness to collapse", right?
>
> Wouldn't a 0 mean "I am not eager, I will not waste any memory, I am very
> careful and bail out on any pte_none" vs. 10 meaning "I am very eager, I
> will collapse no matter what I find in the page table, waste as much memory
> as I want"?

Yeah, this is my understanding of your scale, or is my understanding also
inverted? :)

Right now it's:

eagerness max_ptes_none

0 -> 511
...
10 -> 0

Right?

So we're saying, currently, 0 means 'I will tolerate up to 511 pte_none, and eat
them all I am very very eager', and 10 means 'I will not tolerate any pte_none'
right?

Correct me if I'm wrong here! :>)

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:23                   ` Lorenzo Stoakes
@ 2025-09-15 11:29                     ` David Hildenbrand
  2025-09-15 11:35                       ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 11:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 13:23, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 01:14:32PM +0200, David Hildenbrand wrote:
>> On 15.09.25 13:02, Lorenzo Stoakes wrote:
>>> On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
>>>> On 15.09.25 12:43, Lorenzo Stoakes wrote:
>>>>> On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
>>>>>>
>>>>>> 0 -> ~100% used (~0% none)
>>>>>> 1 -> ~50% used (~50% none)
>>>>>> 2 -> ~25% used (~75% none)
>>>>>> 3 -> ~12.5% used (~87.5% none)
>>>>>> 4 -> ~11.25% used (~88,75% none)
>>>>>> ...
>>>>>> 10 -> ~0% used (~100% none)
>>>>>
>>>>> Oh and shouldn't this be inverted?
>>>>>
>>>>> 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
>>>>> 10 eagerness = we aren't eager to eat up none PTE entries at all?
>>>>>
>>>>> Or am I being dumb here?
>>>>
>>>> Good question.
>>>>
>>>> For swappiness it's: 0 -> no swap (conservative)
>>>>
>>>> So intuitively I assumed: 0 -> no pte_none (conservative)
>>>>
>>>> You're the native speaker, so you tell me :)
>>>
>>> To me this is about 'eagerness to consume empty PTE entries' so 10 is more
>>> eager, 0 is not eager at all, i.e. inversion of what you suggest :)
>>
>> Just so we are on the same page: it is about "eagerness to collapse", right?
>>
>> Wouldn't a 0 mean "I am not eager, I will not waste any memory, I am very
>> careful and bail out on any pte_none" vs. 10 meaning "I am very eager, I
>> will collapse no matter what I find in the page table, waste as much memory
>> as I want"?
> 
> Yeah, this is my understanding of your scale, or is my understanding also
> inverted? :)
> 
> Right now it's:
> 
> eagerness max_ptes_none
> 
> 0 -> 511
> ...
> 10 -> 0
> 
> Right?

Just so we are on the same page, this is what I had:

0 -> ~100% used (~0% none)

So "0" -> 0 pte_none or 512 used.

(note the used vs. none)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:29                     ` David Hildenbrand
@ 2025-09-15 11:35                       ` Lorenzo Stoakes
  2025-09-15 11:45                         ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 11:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 01:29:22PM +0200, David Hildenbrand wrote:
> On 15.09.25 13:23, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 01:14:32PM +0200, David Hildenbrand wrote:
> > > On 15.09.25 13:02, Lorenzo Stoakes wrote:
> > > > On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
> > > > > On 15.09.25 12:43, Lorenzo Stoakes wrote:
> > > > > > On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
> > > > > > >
> > > > > > > 0 -> ~100% used (~0% none)
> > > > > > > 1 -> ~50% used (~50% none)
> > > > > > > 2 -> ~25% used (~75% none)
> > > > > > > 3 -> ~12.5% used (~87.5% none)
> > > > > > > 4 -> ~11.25% used (~88,75% none)
> > > > > > > ...
> > > > > > > 10 -> ~0% used (~100% none)
> > > > > >
> > > > > > Oh and shouldn't this be inverted?
> > > > > >
> > > > > > 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
> > > > > > 10 eagerness = we aren't eager to eat up none PTE entries at all?
> > > > > >
> > > > > > Or am I being dumb here?
> > > > >
> > > > > Good question.
> > > > >
> > > > > For swappiness it's: 0 -> no swap (conservative)
> > > > >
> > > > > So intuitively I assumed: 0 -> no pte_none (conservative)
> > > > >
> > > > > You're the native speaker, so you tell me :)
> > > >
> > > > To me this is about 'eagerness to consume empty PTE entries' so 10 is more
> > > > eager, 0 is not eager at all, i.e. inversion of what you suggest :)
> > >
> > > Just so we are on the same page: it is about "eagerness to collapse", right?
> > >
> > > Wouldn't a 0 mean "I am not eager, I will not waste any memory, I am very
> > > careful and bail out on any pte_none" vs. 10 meaning "I am very eager, I
> > > will collapse no matter what I find in the page table, waste as much memory
> > > as I want"?
> >
> > Yeah, this is my understanding of your scale, or is my understanding also
> > inverted? :)
> >
> > Right now it's:
> >
> > eagerness max_ptes_none
> >
> > 0 -> 511
> > ...
> > 10 -> 0
> >
> > Right?
>
> Just so we are on the same page, this is what I had:
>
> 0 -> ~100% used (~0% none)
>
> So "0" -> 0 pte_none or 512 used.
>
> (note the used vs. none)

OK right so we're talking about the same thing, I guess?

I was confused partly becuase of the scale, becuase weren't people setting
this parameter to low values in practice?

And now we make it so we have equivalent of:

0 -> 0
1 -> 256
2 -> 384

With the logarithmic values more tightly bunched at the 'eager' end?

Weren't people setting max_ptes_none to like 20 or 30 or something? So we
should 'bunch' at the other end?

And also aren't we saying that anything over 256 is broken for mTHP? So
weren't we trying to avoid that?

I think it should be something like:

(eagerness -> max_pte_none)

0 -> 0
1 -> ~small %
2 -> ~slightly larger %
etc.
9 -> 50%
10 -> 100%

Right?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 10:22         ` David Hildenbrand
  2025-09-15 10:35           ` Lorenzo Stoakes
  2025-09-15 10:43           ` Lorenzo Stoakes
@ 2025-09-15 11:41           ` Nico Pache
  2025-09-15 12:59             ` David Hildenbrand
  2 siblings, 1 reply; 79+ messages in thread
From: Nico Pache @ 2025-09-15 11:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato

On Mon, Sep 15, 2025 at 4:22 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.09.25 11:22, Kiryl Shutsemau wrote:
> > On Fri, Sep 12, 2025 at 05:31:51PM -0600, Nico Pache wrote:
> >> On Fri, Sep 12, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
> >>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> >>>>> The following series provides khugepaged with the capability to collapse
> >>>>> anonymous memory regions to mTHPs.
> >>>>>
> >>>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
> >>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> >>>>> range. The restriction on max_ptes_none is removed during the scan, to make
> >>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
> >>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> >>>>> by the attempted collapse order to determine how full a mTHP must be to be
> >>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> >>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
> >>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
> >>>>> to be enabled.
> >>>>>
> >>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> >>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> >>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
> >>>>> constantly promoting mTHPs to the next available size, which would occur
> >>>>> because a collapse introduces more non-zero pages that would satisfy the
> >>>>> promotion condition on subsequent scans.
> >>>>
> >>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> >>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> >>>>
> >>>
> >>> I am all for not adding any more ugliness on top of all the ugliness we
> >>> added in the past.
> >>>
> >>> I will soon propose deprecating that parameter in favor of something
> >>> that makes a bit more sense.
> >>>
> >>> In essence, we'll likely have an "eagerness" parameter that ranges from
> >>> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> >>> not all is populated".
> >> Hi David,
> >>
> >> Do you have any reason for 0-10, I'm guessing these will map to
> >> different max_ptes_none values.
> >> I suggest 0-5, mapping to 0,32,64,128,255,511
> >
> > That's too x86-64 specific.
Its technically formulated from:

X = ( HPAGE_PMD_NR >> (5 - n) ) - 1
where n is the value of eagerness and X is the number of none_ptes we allow
so 5 == (512 >> 0) - 1 = 511
     4 == (512 >> 1) - 1 = 255
     3 == 128 - 1 = 127
....

Any scale we use will suffer from inaccuracy
Currently this fits well into the bitmap algorithm because the lower
you go in the bitmap (smaller orders), the more inaccurate the
max_ptes_none (or any scale to that matter) will have on the value.
for example: a 16kB mTHP is 4 pages. you really only have 4 options
for the number of none_ptes you will allow, so any scale will be
rounded heavily towards the lower orders.
128 (max_ptes_none) >> (9 (pmd_order) - 2 (collapse order)) = 1 none pte allowed
255 >> 7 = 1 none_pte allowed
no value inbetween these has any effect
where as
127 >> 7 = 0

So using a consistent scale that is relative to the number of PTEs in
a given mTHP I think is the most straightforward approach.


> >
> > And the whole idea is not to map to directly, but give kernel wiggle
> > room to play.
>
> Initially we will start out simple and map it directly. But yeah, the
> idea is to give us some more room later.
>
> I had something logarithmic in mind which would roughly be (ignoring the
> the weird -1 for simplicity and expressing it as "used" instead of
> none-or-zero)
>
> 0 -> ~100% used (~0% none)
> 1 -> ~50% used (~50% none)
> 2 -> ~25% used (~75% none)
> 3 -> ~12.5% used (~87.5% none)
> 4 -> ~11.25% used (~88,75% none)
> ...
> 10 -> ~0% used (~100% none)
I think this scale is too specific, I think it would be easier to map
to the one above for the reasons stated there. There would be little
to no benefit to having such small adjustments between 4-10

Let me know what you think
-- Nico
>
> Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:35                       ` Lorenzo Stoakes
@ 2025-09-15 11:45                         ` David Hildenbrand
  2025-09-15 12:01                           ` Kiryl Shutsemau
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 11:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On 15.09.25 13:35, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 01:29:22PM +0200, David Hildenbrand wrote:
>> On 15.09.25 13:23, Lorenzo Stoakes wrote:
>>> On Mon, Sep 15, 2025 at 01:14:32PM +0200, David Hildenbrand wrote:
>>>> On 15.09.25 13:02, Lorenzo Stoakes wrote:
>>>>> On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
>>>>>> On 15.09.25 12:43, Lorenzo Stoakes wrote:
>>>>>>> On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> 0 -> ~100% used (~0% none)
>>>>>>>> 1 -> ~50% used (~50% none)
>>>>>>>> 2 -> ~25% used (~75% none)
>>>>>>>> 3 -> ~12.5% used (~87.5% none)
>>>>>>>> 4 -> ~11.25% used (~88,75% none)
>>>>>>>> ...
>>>>>>>> 10 -> ~0% used (~100% none)
>>>>>>>
>>>>>>> Oh and shouldn't this be inverted?
>>>>>>>
>>>>>>> 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
>>>>>>> 10 eagerness = we aren't eager to eat up none PTE entries at all?
>>>>>>>
>>>>>>> Or am I being dumb here?
>>>>>>
>>>>>> Good question.
>>>>>>
>>>>>> For swappiness it's: 0 -> no swap (conservative)
>>>>>>
>>>>>> So intuitively I assumed: 0 -> no pte_none (conservative)
>>>>>>
>>>>>> You're the native speaker, so you tell me :)
>>>>>
>>>>> To me this is about 'eagerness to consume empty PTE entries' so 10 is more
>>>>> eager, 0 is not eager at all, i.e. inversion of what you suggest :)
>>>>
>>>> Just so we are on the same page: it is about "eagerness to collapse", right?
>>>>
>>>> Wouldn't a 0 mean "I am not eager, I will not waste any memory, I am very
>>>> careful and bail out on any pte_none" vs. 10 meaning "I am very eager, I
>>>> will collapse no matter what I find in the page table, waste as much memory
>>>> as I want"?
>>>
>>> Yeah, this is my understanding of your scale, or is my understanding also
>>> inverted? :)
>>>
>>> Right now it's:
>>>
>>> eagerness max_ptes_none
>>>
>>> 0 -> 511
>>> ...
>>> 10 -> 0
>>>
>>> Right?
>>
>> Just so we are on the same page, this is what I had:
>>
>> 0 -> ~100% used (~0% none)
>>
>> So "0" -> 0 pte_none or 512 used.
>>
>> (note the used vs. none)
> 
> OK right so we're talking about the same thing, I guess?
> 
> I was confused partly becuase of the scale, becuase weren't people setting
> this parameter to low values in practice?
> 
> And now we make it so we have equivalent of:
> 
> 0 -> 0
> 1 -> 256
> 2 -> 384

Ah, there is the problem, that's not what I had in mind.

0 -> ~100% used (~0% none)
...
8 -> ~87,5% used (~12.5% none)
9 -> ~75% used (~25% none)
9 -> ~50% used (~50% none)
10 -> ~0% used (~100% none)

Hopefully I didn't mess it up again.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:45                         ` David Hildenbrand
@ 2025-09-15 12:01                           ` Kiryl Shutsemau
  2025-09-15 12:09                             ` Lorenzo Stoakes
  0 siblings, 1 reply; 79+ messages in thread
From: Kiryl Shutsemau @ 2025-09-15 12:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 01:45:39PM +0200, David Hildenbrand wrote:
> On 15.09.25 13:35, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 01:29:22PM +0200, David Hildenbrand wrote:
> > > On 15.09.25 13:23, Lorenzo Stoakes wrote:
> > > > On Mon, Sep 15, 2025 at 01:14:32PM +0200, David Hildenbrand wrote:
> > > > > On 15.09.25 13:02, Lorenzo Stoakes wrote:
> > > > > > On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
> > > > > > > On 15.09.25 12:43, Lorenzo Stoakes wrote:
> > > > > > > > On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
> > > > > > > > > 
> > > > > > > > > 0 -> ~100% used (~0% none)
> > > > > > > > > 1 -> ~50% used (~50% none)
> > > > > > > > > 2 -> ~25% used (~75% none)
> > > > > > > > > 3 -> ~12.5% used (~87.5% none)
> > > > > > > > > 4 -> ~11.25% used (~88,75% none)
> > > > > > > > > ...
> > > > > > > > > 10 -> ~0% used (~100% none)
> > > > > > > > 
> > > > > > > > Oh and shouldn't this be inverted?
> > > > > > > > 
> > > > > > > > 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
> > > > > > > > 10 eagerness = we aren't eager to eat up none PTE entries at all?
> > > > > > > > 
> > > > > > > > Or am I being dumb here?
> > > > > > > 
> > > > > > > Good question.
> > > > > > > 
> > > > > > > For swappiness it's: 0 -> no swap (conservative)
> > > > > > > 
> > > > > > > So intuitively I assumed: 0 -> no pte_none (conservative)
> > > > > > > 
> > > > > > > You're the native speaker, so you tell me :)
> > > > > > 
> > > > > > To me this is about 'eagerness to consume empty PTE entries' so 10 is more
> > > > > > eager, 0 is not eager at all, i.e. inversion of what you suggest :)
> > > > > 
> > > > > Just so we are on the same page: it is about "eagerness to collapse", right?
> > > > > 
> > > > > Wouldn't a 0 mean "I am not eager, I will not waste any memory, I am very
> > > > > careful and bail out on any pte_none" vs. 10 meaning "I am very eager, I
> > > > > will collapse no matter what I find in the page table, waste as much memory
> > > > > as I want"?
> > > > 
> > > > Yeah, this is my understanding of your scale, or is my understanding also
> > > > inverted? :)
> > > > 
> > > > Right now it's:
> > > > 
> > > > eagerness max_ptes_none
> > > > 
> > > > 0 -> 511
> > > > ...
> > > > 10 -> 0
> > > > 
> > > > Right?
> > > 
> > > Just so we are on the same page, this is what I had:
> > > 
> > > 0 -> ~100% used (~0% none)
> > > 
> > > So "0" -> 0 pte_none or 512 used.
> > > 
> > > (note the used vs. none)
> > 
> > OK right so we're talking about the same thing, I guess?
> > 
> > I was confused partly becuase of the scale, becuase weren't people setting
> > this parameter to low values in practice?
> > 
> > And now we make it so we have equivalent of:
> > 
> > 0 -> 0
> > 1 -> 256
> > 2 -> 384
> 
> Ah, there is the problem, that's not what I had in mind.
> 
> 0 -> ~100% used (~0% none)
> ...
> 8 -> ~87,5% used (~12.5% none)
> 9 -> ~75% used (~25% none)
> 9 -> ~50% used (~50% none)
> 10 -> ~0% used (~100% none)
> 
> Hopefully I didn't mess it up again.

I think this kind of table is fine for initial implementation of the
knob, but we don't want to document it to userspace like this.
I think we want to be strategically ambiguous on what the knob does
exactly, so kernel could evolve the meaning of the knob over time.

We don't want to repeat the problem we have with max_ptes_none which too
prescriptive and got additional meaning with introduction of shrinker.

As kernel evolves, we want ability to adjust the meaning and keep the
knob useful.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 12:01                           ` Kiryl Shutsemau
@ 2025-09-15 12:09                             ` Lorenzo Stoakes
  0 siblings, 0 replies; 79+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 12:09 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: David Hildenbrand, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt, jannh,
	pfalcato

On Mon, Sep 15, 2025 at 01:01:26PM +0100, Kiryl Shutsemau wrote:
> On Mon, Sep 15, 2025 at 01:45:39PM +0200, David Hildenbrand wrote:
> > On 15.09.25 13:35, Lorenzo Stoakes wrote:
> > > On Mon, Sep 15, 2025 at 01:29:22PM +0200, David Hildenbrand wrote:
> > > > On 15.09.25 13:23, Lorenzo Stoakes wrote:
> > > > > On Mon, Sep 15, 2025 at 01:14:32PM +0200, David Hildenbrand wrote:
> > > > > > On 15.09.25 13:02, Lorenzo Stoakes wrote:
> > > > > > > On Mon, Sep 15, 2025 at 12:52:03PM +0200, David Hildenbrand wrote:
> > > > > > > > On 15.09.25 12:43, Lorenzo Stoakes wrote:
> > > > > > > > > On Mon, Sep 15, 2025 at 12:22:07PM +0200, David Hildenbrand wrote:
> > > > > > > > > >
> > > > > > > > > > 0 -> ~100% used (~0% none)
> > > > > > > > > > 1 -> ~50% used (~50% none)
> > > > > > > > > > 2 -> ~25% used (~75% none)
> > > > > > > > > > 3 -> ~12.5% used (~87.5% none)
> > > > > > > > > > 4 -> ~11.25% used (~88,75% none)
> > > > > > > > > > ...
> > > > > > > > > > 10 -> ~0% used (~100% none)
> > > > > > > > >
> > > > > > > > > Oh and shouldn't this be inverted?
> > > > > > > > >
> > > > > > > > > 0 eagerness = we eat up all none PTE entries? Isn't that pretty eager? :P
> > > > > > > > > 10 eagerness = we aren't eager to eat up none PTE entries at all?
> > > > > > > > >
> > > > > > > > > Or am I being dumb here?
> > > > > > > >
> > > > > > > > Good question.
> > > > > > > >
> > > > > > > > For swappiness it's: 0 -> no swap (conservative)
> > > > > > > >
> > > > > > > > So intuitively I assumed: 0 -> no pte_none (conservative)
> > > > > > > >
> > > > > > > > You're the native speaker, so you tell me :)
> > > > > > >
> > > > > > > To me this is about 'eagerness to consume empty PTE entries' so 10 is more
> > > > > > > eager, 0 is not eager at all, i.e. inversion of what you suggest :)
> > > > > >
> > > > > > Just so we are on the same page: it is about "eagerness to collapse", right?
> > > > > >
> > > > > > Wouldn't a 0 mean "I am not eager, I will not waste any memory, I am very
> > > > > > careful and bail out on any pte_none" vs. 10 meaning "I am very eager, I
> > > > > > will collapse no matter what I find in the page table, waste as much memory
> > > > > > as I want"?
> > > > >
> > > > > Yeah, this is my understanding of your scale, or is my understanding also
> > > > > inverted? :)
> > > > >
> > > > > Right now it's:
> > > > >
> > > > > eagerness max_ptes_none
> > > > >
> > > > > 0 -> 511
> > > > > ...
> > > > > 10 -> 0
> > > > >
> > > > > Right?
> > > >
> > > > Just so we are on the same page, this is what I had:
> > > >
> > > > 0 -> ~100% used (~0% none)
> > > >
> > > > So "0" -> 0 pte_none or 512 used.
> > > >
> > > > (note the used vs. none)
> > >
> > > OK right so we're talking about the same thing, I guess?
> > >
> > > I was confused partly becuase of the scale, becuase weren't people setting
> > > this parameter to low values in practice?
> > >
> > > And now we make it so we have equivalent of:
> > >
> > > 0 -> 0
> > > 1 -> 256
> > > 2 -> 384
> >
> > Ah, there is the problem, that's not what I had in mind.
> >
> > 0 -> ~100% used (~0% none)
> > ...
> > 8 -> ~87,5% used (~12.5% none)
> > 9 -> ~75% used (~25% none)
> > 9 -> ~50% used (~50% none)
> > 10 -> ~0% used (~100% none)
> >
> > Hopefully I didn't mess it up again.
>
> I think this kind of table is fine for initial implementation of the
> knob, but we don't want to document it to userspace like this.
> I think we want to be strategically ambiguous on what the knob does
> exactly, so kernel could evolve the meaning of the knob over time.
>
> We don't want to repeat the problem we have with max_ptes_none which too
> prescriptive and got additional meaning with introduction of shrinker.
>
> As kernel evolves, we want ability to adjust the meaning and keep the
> knob useful.

I mean, having said this exact thing several times in the thread obviously
I agree... FWIW...

To repeat, I think it should be an abstraction that we entirely control and
whose meaning we can vary over time.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:10                       ` David Hildenbrand
  2025-09-15 11:13                         ` Lorenzo Stoakes
@ 2025-09-15 12:16                         ` Usama Arif
  1 sibling, 0 replies; 79+ messages in thread
From: Usama Arif @ 2025-09-15 12:16 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
	richard.weiyang, lance.yang, vbabka, rppt, jannh, pfalcato



On 15/09/2025 12:10, David Hildenbrand wrote:
> On 15.09.25 12:59, Lorenzo Stoakes wrote:
>> On Mon, Sep 15, 2025 at 12:52:53PM +0200, David Hildenbrand wrote:
>>> On 15.09.25 12:48, Lorenzo Stoakes wrote:
>>>> On Mon, Sep 15, 2025 at 12:44:34PM +0200, David Hildenbrand wrote:
>>>>>>>> Mapping that to actual THP sizes (#pages in a thp) on an arch will be easy.
>>>>>>>
>>>>>>> And at different mTHP levels too right?
>>>>>>>
>>>>>>
>>>>>> Another point here, since we have to keep:
>>>>>>
>>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>>>
>>>>>> Around, and users will try to set values there, presumably we will now add:
>>>>>>
>>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/eagerness
>>>>>>
>>>>>> How will we map <-> the two tunables?
>>>>>
>>>>> Well, the easy case if someone updates eagerness, then we simply et it to
>>>>> whatever magic value we compute and document.
>>>>>
>>>>> The other direction is more problematic, likely we'll simply warn and do
>>>>> something reasonable (map it to whatever eagerness scale is closest or
>>>>> simply indicate it as "-1" -- user intervened or sth like that)
>>>>
>>>> I don't love the idea of a -1 situation, as that's going to create some
>>>> confusion.
>>>
>>> swapiness also has a "max" parameter, so we could just say "override" /"
>>> disabled" / whatever?
>>
>> I don't love the user being able to override this though, let's just nuke their
>> ability to set this pleeeease.

Do you mean stop people from changing max_ptes_none? I am not sure if thats a good idea,
Its existed for a very long time and even a few release warnings might not be enough
of a warning for sysadmins that might not have a kernel team to notice this.

If the eagerness solution is just a logarithmic mapping of max_ptes_none at the start, I do
think we need to keep max_ptes_none completely supported. As eagerness isnt really doing
something new? Once eagnerness diverges from just setting max_ptes_none, only then
we should start thinking about deprecating it?


>>
>> Because if they can override it, then we have to do some deeply nasty scaling
>> for mTHP again.
> 
> There are ways to have it working internally, just using a different "scale" instead of the 100 -> 50 -> 25 etc.
> 
> I am afraid we cannot change the parameter to ignore other values because of the interaction with the shrinker that easily .... we might be able to detracted at wait a bunch of kernel releases probably.
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 11:41           ` Nico Pache
@ 2025-09-15 12:59             ` David Hildenbrand
  0 siblings, 0 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 12:59 UTC (permalink / raw)
  To: Nico Pache
  Cc: Kiryl Shutsemau, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, hughd, richard.weiyang, lance.yang,
	vbabka, rppt, jannh, pfalcato


>> ...
>> 10 -> ~0% used (~100% none)
> I think this scale is too specific, I think it would be easier to map
> to the one above for the reasons stated there. There would be little
> to no benefit to having such small adjustments between 4-10

It's probably best to discuss that once I have something more concrete 
to share :)

I yet have to think about some cases I have in mind, and once I figured 
them out (and had time to do so ...) I'll post something.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-12 13:46       ` David Hildenbrand
  2025-09-12 14:01         ` Lorenzo Stoakes
  2025-09-12 15:15         ` Pedro Falcato
@ 2025-09-15 13:43         ` Johannes Weiner
  2025-09-15 14:45           ` David Hildenbrand
  2 siblings, 1 reply; 79+ messages in thread
From: Johannes Weiner @ 2025-09-15 13:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, rientjes,
	mhocko, rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt,
	jannh, pfalcato

On Fri, Sep 12, 2025 at 03:46:36PM +0200, David Hildenbrand wrote:
> On 12.09.25 15:37, Johannes Weiner wrote:
> > On Fri, Sep 12, 2025 at 02:25:31PM +0200, David Hildenbrand wrote:
> >> On 12.09.25 14:19, Kiryl Shutsemau wrote:
> >>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> >>>> The following series provides khugepaged with the capability to collapse
> >>>> anonymous memory regions to mTHPs.
> >>>>
> >>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
> >>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> >>>> range. The restriction on max_ptes_none is removed during the scan, to make
> >>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
> >>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> >>>> by the attempted collapse order to determine how full a mTHP must be to be
> >>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> >>>> contains swapped out, or shared pages, we don't perform the collapse. It is
> >>>> now also possible to collapse to mTHPs without requiring the PMD THP size
> >>>> to be enabled.
> >>>>
> >>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> >>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> >>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
> >>>> constantly promoting mTHPs to the next available size, which would occur
> >>>> because a collapse introduces more non-zero pages that would satisfy the
> >>>> promotion condition on subsequent scans.
> >>>
> >>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> >>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> >>>
> >>
> >> I am all for not adding any more ugliness on top of all the ugliness we
> >> added in the past.
> >>
> >> I will soon propose deprecating that parameter in favor of something
> >> that makes a bit more sense.
> >>
> >> In essence, we'll likely have an "eagerness" parameter that ranges from
> >> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
> >> not all is populated".
> >>
> >> In between we will have more flexibility on how to set these values.
> >>
> >> Likely 9 will be around 50% to not even motivate the user to set
> >> something that does not make sense (creep).
> > 
> > One observation we've had from production experiments is that the
> > optimal number here isn't static. If you have plenty of memory, then
> > even very sparse THPs are beneficial.
> 
> Exactly.
> 
> And willy suggested something like "eagerness" similar to "swapinness" 
> that gives us more flexibility when implementing it, including 
> dynamically adjusting the values in the future.

I think we talked past each other a bit here. The point I was trying
to make is that the optimal behavior depends on the pressure situation
inside the kernel; it's fundamentally not something userspace can make
informed choices about.

So for max_ptes_none, the approach is basically: try a few settings
and see which one performs best. Okay, not great. But wouldn't that be
the same for an eagerness setting? What would be the mental model for
the user when configuring this? If it's the same empirical approach,
then the new knob would seem like a lateral move.

It would also be difficult to change the implementation without
risking regressions once production systems are tuned to the old
behavior.

> > An extreme example: if all your THPs have 2/512 pages populated,
> > that's still cutting TLB pressure in half!
> 
> IIRC, you create more pressure on the huge entries, where you might have 
> less TLB entries :) But yes, there can be cases where it is beneficial, 
> if there is absolutely no memory pressure.

Ha, the TLB topology is a whole other can of worms.

We've tried deploying THP on older systems with separate TLB entries
for different page sizes and gave up. It's a nightmare to configure
and very easy to do worse than base pages.

The kernel itself is using a mix of page sizes for the identity
mapping. You basically have to complement the userspace page size
distribution in such a way that you don't compete over the wrong
entries at runtime. It's just stupid. I'm honestly not sure this is
realistically solvable.

So we're deploying THP only on newer AMD machines where TLB entries
are shared.

For split TLBs, we're sticking with hugetlb and trial-and-error.

Please don't build CPUs this way.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v11 00/15] khugepaged: mTHP support
  2025-09-15 13:43         ` Johannes Weiner
@ 2025-09-15 14:45           ` David Hildenbrand
  0 siblings, 0 replies; 79+ messages in thread
From: David Hildenbrand @ 2025-09-15 14:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kiryl Shutsemau, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, rientjes,
	mhocko, rdunlap, hughd, richard.weiyang, lance.yang, vbabka, rppt,
	jannh, pfalcato

On 15.09.25 15:43, Johannes Weiner wrote:
> On Fri, Sep 12, 2025 at 03:46:36PM +0200, David Hildenbrand wrote:
>> On 12.09.25 15:37, Johannes Weiner wrote:
>>> On Fri, Sep 12, 2025 at 02:25:31PM +0200, David Hildenbrand wrote:
>>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
>>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
>>>>>> The following series provides khugepaged with the capability to collapse
>>>>>> anonymous memory regions to mTHPs.
>>>>>>
>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
>>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
>>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
>>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
>>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
>>>>>> by the attempted collapse order to determine how full a mTHP must be to be
>>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
>>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
>>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
>>>>>> to be enabled.
>>>>>>
>>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
>>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
>>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
>>>>>> constantly promoting mTHPs to the next available size, which would occur
>>>>>> because a collapse introduces more non-zero pages that would satisfy the
>>>>>> promotion condition on subsequent scans.
>>>>>
>>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
>>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
>>>>>
>>>>
>>>> I am all for not adding any more ugliness on top of all the ugliness we
>>>> added in the past.
>>>>
>>>> I will soon propose deprecating that parameter in favor of something
>>>> that makes a bit more sense.
>>>>
>>>> In essence, we'll likely have an "eagerness" parameter that ranges from
>>>> 0 to 10. 10 is essentially "always collapse" and 0 "never collapse if
>>>> not all is populated".
>>>>
>>>> In between we will have more flexibility on how to set these values.
>>>>
>>>> Likely 9 will be around 50% to not even motivate the user to set
>>>> something that does not make sense (creep).
>>>
>>> One observation we've had from production experiments is that the
>>> optimal number here isn't static. If you have plenty of memory, then
>>> even very sparse THPs are beneficial.
>>
>> Exactly.
>>
>> And willy suggested something like "eagerness" similar to "swapinness"
>> that gives us more flexibility when implementing it, including
>> dynamically adjusting the values in the future.
> 
> I think we talked past each other a bit here. The point I was trying
> to make is that the optimal behavior depends on the pressure situation
> inside the kernel; it's fundamentally not something userspace can make
> informed choices about.

I don't think the "no tunable at all" approach solely based on pressure 
will be workable in the foreseeable future.

Collapsing 2 pages to 2 MiB THP all over the system just to split it 
immediately again is not something particularly helpful.

So long term I assume the eagerness will work together with memory 
pressure and probably some other inputs.

> 
> So for max_ptes_none, the approach is basically: try a few settings
> and see which one performs best. Okay, not great. But wouldn't that be
> the same for an eagerness setting? What would be the mental model for
> the user when configuring this? If it's the same empirical approach,
> then the new knob would seem like a lateral move.

Consider it a replacement for something that is oddly PMD specific and 
requires you to punch in magical values (e.g., 511 on x86, 2047 on arm64 
64k).

Initially I thought about just using a percentage/scale of (m)THP but 
Willy argued that something more abstract gives us more wiggle room.

Yes, for some workloads you will likely still have to fine tune 
parameters (honestly, I don't think many companies besides Meta are 
doing that), but the idea is to evolve it over time to something that is 
smarter than punching in magic values into an obscure interface.

> 
> It would also be difficult to change the implementation without
> risking regressions once production systems are tuned to the old
> behavior.

Companies like Meta that do such a level of fine-tuning probably use the 
old nasty interface because they know exactly what they are doing.

That is a corner case, though.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2025-09-15 14:45 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-12  3:27 [PATCH v11 00/15] khugepaged: mTHP support Nico Pache
2025-09-12  3:27 ` [PATCH v11 01/15] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2025-09-12  3:27 ` [PATCH v11 02/15] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-09-12  3:27 ` [PATCH v11 03/15] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-09-12  3:27 ` [PATCH v11 04/15] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-09-12  3:28 ` [PATCH v11 05/15] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-09-12  3:28 ` [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
2025-09-12 13:35   ` Lorenzo Stoakes
2025-09-12 23:26     ` Nico Pache
2025-09-15 10:30       ` Lorenzo Stoakes
2025-09-12  3:28 ` [PATCH v11 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2025-09-12  3:28 ` [PATCH v11 08/15] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-09-12  3:28 ` [PATCH v11 09/15] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2025-09-12  9:35   ` Baolin Wang
2025-09-12  3:28 ` [PATCH v11 10/15] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-09-12  3:28 ` [PATCH v11 11/15] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2025-09-12  9:24   ` Baolin Wang
2025-09-12  3:28 ` [PATCH v11 12/15] khugepaged: Introduce mTHP collapse support Nico Pache
2025-09-12  3:28 ` [PATCH v11 13/15] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-09-12  3:28 ` [PATCH v11 14/15] khugepaged: run khugepaged for all orders Nico Pache
2025-09-12  3:28 ` [PATCH v11 15/15] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-09-12  8:43 ` [PATCH v11 00/15] khugepaged: mTHP support Lorenzo Stoakes
2025-09-12 12:19 ` Kiryl Shutsemau
2025-09-12 12:25   ` David Hildenbrand
2025-09-12 13:37     ` Johannes Weiner
2025-09-12 13:46       ` David Hildenbrand
2025-09-12 14:01         ` Lorenzo Stoakes
2025-09-12 15:35           ` Pedro Falcato
2025-09-12 15:45             ` Lorenzo Stoakes
2025-09-12 15:15         ` Pedro Falcato
2025-09-12 15:38           ` Kiryl Shutsemau
2025-09-12 15:43             ` David Hildenbrand
2025-09-12 15:44             ` Kiryl Shutsemau
2025-09-12 15:51               ` David Hildenbrand
2025-09-15 13:43         ` Johannes Weiner
2025-09-15 14:45           ` David Hildenbrand
2025-09-12 23:31     ` Nico Pache
2025-09-15  9:22       ` Kiryl Shutsemau
2025-09-15 10:22         ` David Hildenbrand
2025-09-15 10:35           ` Lorenzo Stoakes
2025-09-15 10:39             ` David Hildenbrand
2025-09-15 10:40             ` Lorenzo Stoakes
2025-09-15 10:44               ` David Hildenbrand
2025-09-15 10:48                 ` Lorenzo Stoakes
2025-09-15 10:52                   ` David Hildenbrand
2025-09-15 10:59                     ` Lorenzo Stoakes
2025-09-15 11:10                       ` David Hildenbrand
2025-09-15 11:13                         ` Lorenzo Stoakes
2025-09-15 11:16                           ` David Hildenbrand
2025-09-15 12:16                         ` Usama Arif
2025-09-15 10:43           ` Lorenzo Stoakes
2025-09-15 10:52             ` David Hildenbrand
2025-09-15 11:02               ` Lorenzo Stoakes
2025-09-15 11:14                 ` David Hildenbrand
2025-09-15 11:23                   ` Lorenzo Stoakes
2025-09-15 11:29                     ` David Hildenbrand
2025-09-15 11:35                       ` Lorenzo Stoakes
2025-09-15 11:45                         ` David Hildenbrand
2025-09-15 12:01                           ` Kiryl Shutsemau
2025-09-15 12:09                             ` Lorenzo Stoakes
2025-09-15 11:41           ` Nico Pache
2025-09-15 12:59             ` David Hildenbrand
2025-09-12 13:47   ` David Hildenbrand
2025-09-12 14:28     ` David Hildenbrand
2025-09-12 14:35       ` Kiryl Shutsemau
2025-09-12 14:56         ` David Hildenbrand
2025-09-12 15:41           ` Kiryl Shutsemau
2025-09-12 15:45             ` David Hildenbrand
2025-09-12 15:51               ` Lorenzo Stoakes
2025-09-12 17:53                 ` David Hildenbrand
2025-09-12 18:21                   ` Lorenzo Stoakes
2025-09-13  0:28                     ` Nico Pache
2025-09-15 10:44                       ` Lorenzo Stoakes
2025-09-15 10:25                     ` David Hildenbrand
2025-09-15 10:32                       ` Lorenzo Stoakes
2025-09-15 10:37                         ` David Hildenbrand
2025-09-15 10:46                           ` Lorenzo Stoakes
2025-09-13  0:18                   ` Nico Pache
2025-09-12 23:35     ` Nico Pache

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).