[RFC v2 0/9] khugepaged: mTHP support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 0/9] khugepaged: mTHP support
@ 2025-02-11  0:30 Nico Pache
  2025-02-11  0:30 ` [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

The following series provides khugepaged and madvise collapse with the
capability to collapse regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
using a bitmap. After the PMD scan is done, we do binary recursion on the
bitmap to find the optimal mTHP sizes for the PMD range. The restriction
on max_ptes_none is removed during the scan, to make sure we account for
the whole PMD range. max_ptes_none will be scaled by the attempted collapse
order to determine how full a THP must be to be eligible. If a mTHP collapse
is attempted, but contains swapped out, or shared pages, we dont perform the
collapse.

With the default max_ptes_none=511, the code should keep its most of its 
original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and 
constantly promote mTHPs to the next available size.

Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
Patch 2:     Refactor/rename hpage_collapse
Patch 3-5:   Generalize khugepaged functions for arbitrary orders
Patch 6-9:   The mTHP patches

---------
 Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits while
   monitoring a number of stats and tracepoints. The code is available 
   here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
   The summary from my testings was that there was no significant regression
   noticed through this test. In some cases my changes had better collapse
   latencies, and was able to scan more pages in the same amount of time/work,
   but for the most part the results were consistant.
- redis testing. I tested these changes along with my defer changes
  (see followup post for more details).
- some basic testing on 64k page size.
- lots of general use. These changes have been running in my VM for some time.

Changes since V1 [2]:
- Minor bug fixes discovered during review and testing
- removed dynamic allocations for bitmaps, and made them stack based
- Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
- Updated trace events to include collapsing order info.
- Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
- No longer require a chunk to be fully utilized before setting the bit. Use
   the same max_ptes_none scaling principle to achieve this.
- Skip mTHP collapse that requires swapin or shared handling. This helps prevent
   some of the "creep" that was discovered in v1.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/

Nico Pache (9):
  introduce khugepaged_collapse_single_pmd to unify khugepaged and
    madvise_collapse
  khugepaged: rename hpage_collapse_* to khugepaged_*
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize alloc_charge_folio for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: improve tracepoints for mTHP orders
  khugepaged: skip collapsing mTHP to smaller orders

 include/linux/khugepaged.h         |   4 +
 include/trace/events/huge_memory.h |  34 ++-
 mm/khugepaged.c                    | 422 +++++++++++++++++++----------
 3 files changed, 306 insertions(+), 154 deletions(-)

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-17 17:11   ` Usama Arif
  2025-02-18 16:26   ` Ryan Roberts
  2025-02-11  0:30 ` [RFC v2 2/9] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.

Create khugepaged_collapse_single_pmd to increase code
reuse and create an entry point for future khugepaged changes.

Refactor madvise_collapse and khugepaged_scan_mm_slot to use
the new khugepaged_collapse_single_pmd function.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
 1 file changed, 50 insertions(+), 46 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5f0be134141e..46faee67378b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2365,6 +2365,52 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *mm,
+				   struct vm_area_struct *vma, bool *mmap_locked,
+				   struct collapse_control *cc)
+{
+	int result = SCAN_FAIL;
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+
+	if (!*mmap_locked) {
+		mmap_read_lock(mm);
+		*mmap_locked = true;
+	}
+
+	if (thp_vma_allowable_order(vma, vma->vm_flags,
+					tva_flags, PMD_ORDER)) {
+		if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
+			struct file *file = get_file(vma->vm_file);
+			pgoff_t pgoff = linear_page_index(vma, addr);
+
+			mmap_read_unlock(mm);
+			*mmap_locked = false;
+			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
+							  cc);
+			fput(file);
+			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+				mmap_read_lock(mm);
+				if (hpage_collapse_test_exit_or_disable(mm))
+					goto end;
+				result = collapse_pte_mapped_thp(mm, addr,
+								 !cc->is_khugepaged);
+				mmap_read_unlock(mm);
+			}
+		} else {
+			result = hpage_collapse_scan_pmd(mm, vma, addr,
+							 mmap_locked, cc);
+		}
+		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
+			++khugepaged_pages_collapsed;
+	}
+end:
+	return result;
+}
+
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2439,33 +2485,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
 
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
+			*result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
+						mm, vma, &mmap_locked, cc);
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2785,36 +2807,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
 		nodes_clear(cc->alloc_nmask);
-		if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
-			fput(file);
-		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
-		}
+		result = khugepaged_collapse_single_pmd(addr, mm, vma, &mmap_locked, cc);
+
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
 		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			BUG_ON(*prev);
-			mmap_read_lock(mm);
-			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
-		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 2/9] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
  2025-02-11  0:30 ` [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-18 16:29   ` Ryan Roberts
  2025-02-11  0:30 ` [RFC v2 3/9] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

functions in khugepaged.c use a mix of hpage_collapse and khugepaged
as the function prefix.

rename all of them to khugepaged to keep things consistent and slightly
shorten the function names.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 52 ++++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 46faee67378b..4c88d17250f4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
 	kmem_cache_destroy(mm_slot_cache);
 }
 
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
 {
-	return hpage_collapse_test_exit(mm) ||
+	return khugepaged_test_exit(mm) ||
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
@@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 	int wakeup;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
 		return;
 
@@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * khugepaged_test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we return all
 		 * pagetables will be destroyed) until khugepaged has finished
 		 * working on the pagetables under the mmap_lock.
@@ -606,7 +606,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
-		/* See hpage_collapse_scan_pmd(). */
+		/* See khugepaged_scan_pmd(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
@@ -851,7 +851,7 @@ struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -886,7 +886,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -905,7 +905,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -925,7 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -992,7 +992,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1078,7 +1078,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
-	int node = hpage_collapse_find_target_node(cc);
+	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
 	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1264,7 +1264,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
 				   struct collapse_control *cc)
@@ -1380,7 +1380,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1449,7 +1449,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (khugepaged_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
@@ -1744,7 +1744,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 
-		if (hpage_collapse_test_exit(mm))
+		if (khugepaged_test_exit(mm))
 			continue;
 		/*
 		 * When a vma is registered with uffd-wp, we cannot recycle
@@ -2266,7 +2266,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2311,7 +2311,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2357,7 +2357,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 #else
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2389,19 +2389,19 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
 
 			mmap_read_unlock(mm);
 			*mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
+			result = khugepaged_scan_file(mm, addr, file, pgoff,
 							  cc);
 			fput(file);
 			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
 				mmap_read_lock(mm);
-				if (hpage_collapse_test_exit_or_disable(mm))
+				if (khugepaged_test_exit_or_disable(mm))
 					goto end;
 				result = collapse_pte_mapped_thp(mm, addr,
 								 !cc->is_khugepaged);
 				mmap_read_unlock(mm);
 			}
 		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
+			result = khugepaged_scan_pmd(mm, vma, addr,
 							 mmap_locked, cc);
 		}
 		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
@@ -2449,7 +2449,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		goto breakouterloop;
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2457,7 +2457,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+		if (unlikely(khugepaged_test_exit_or_disable(mm))) {
 			progress++;
 			break;
 		}
@@ -2479,7 +2479,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(khugepaged_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2515,7 +2515,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (khugepaged_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 3/9] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
  2025-02-11  0:30 ` [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
  2025-02-11  0:30 ` [RFC v2 2/9] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-11  0:30 ` [RFC v2 4/9] khugepaged: generalize alloc_charge_folio " Nico Pache
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

For khugepaged to support different mTHP orders, we must generalize this
function for arbitrary orders.

No functional change in this patch.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4c88d17250f4..c834ea842847 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   bool expect_anon,
 				   struct vm_area_struct **vmap,
-				   struct collapse_control *cc)
+				   struct collapse_control *cc, int order)
 {
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
@@ -932,9 +932,9 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
+	if (!thp_vma_suitable_order(vma, address, order))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2796,7 +2796,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 cc, HPAGE_PMD_ORDER);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 4/9] khugepaged: generalize alloc_charge_folio for mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (2 preceding siblings ...)
  2025-02-11  0:30 ` [RFC v2 3/9] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-19 15:29   ` Ryan Roberts
  2025-02-11  0:30 ` [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* " Nico Pache
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

alloc_charge_folio allocates the new folio for the khugepaged collapse.
Generalize the order of the folio allocations to support future mTHP
collapsing.

No functional changes in this patch.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c834ea842847..0cfcdc11cabd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,14 +1074,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, int order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -1125,7 +1125,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1851,7 +1851,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (3 preceding siblings ...)
  2025-02-11  0:30 ` [RFC v2 4/9] khugepaged: generalize alloc_charge_folio " Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-19 15:39   ` Ryan Roberts
  2025-02-11  0:30 ` [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.

mTHP collapse can suffer from incosistant behavior, and memory waste
"creep". disable swapin and shared support for mTHP collapse.

No functional changes in this patch.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 48 ++++++++++++++++++++++++++++--------------------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0cfcdc11cabd..3776055bd477 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
 					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+					struct list_head *compound_pagelist,
+					u8 order)
 {
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= scaled_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See khugepaged_scan_pmd(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared)) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -711,14 +713,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist,
+						u8 order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, address += PAGE_SIZE) {
+	for (_pte = pte; _pte < pte + (1 << order);
+		_pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
@@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 					     pmd_t *pmd,
 					     pmd_t orig_pmd,
 					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+					     struct list_head *compound_pagelist,
+					     u8 order)
 {
 	spinlock_t *pmd_ptl;
 
@@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
 }
 
 /*
@@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, u8 order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 compound_pagelist, order);
 
 	return result;
 }
@@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
 				       struct vm_area_struct *vma,
 				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+				       int referenced, u8 order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = haddr + (PAGE_SIZE << order);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		if (!is_swap_pte(vmf.orig_pte))
 			continue;
 
+		if (order != HPAGE_PMD_ORDER) {
+			result = SCAN_EXCEED_SWAP_PTE;
+			goto out;
+		}
+
 		vmf.pte = pte;
 		vmf.ptl = ptl;
 		ret = do_swap_page(&vmf);
@@ -1114,7 +1123,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
-
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	/*
@@ -1149,7 +1157,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+				referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1196,7 +1204,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist);
+					&compound_pagelist, HPAGE_PMD_ORDER);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1226,7 +1234,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (4 preceding siblings ...)
  2025-02-11  0:30 ` [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* " Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-17  7:27   ` Dev Jain
                     ` (2 more replies)
  2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
                   ` (6 subsequent siblings)
  12 siblings, 3 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

khugepaged scans PMD ranges for potential collapse to a hugepage. To add
mTHP support we use this scan to instead record chunks of fully utilized
sections of the PMD.

create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
by default we will set this to order 3. The reasoning is that for 4K 512
PMD size this results in a 64 bit bitmap which has some optimizations.
For other arches like ARM64 64K, we can set a larger order if needed.

khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
that represents chunks of utilized regions. We can then determine what
mTHP size fits best and in the following patch, we set this bitmap while
scanning the PMD.

max_ptes_none is used as a scale to determine how "full" an order must
be before being considered for collapse.

If a order is set to "always" lets always collapse to that order in a
greedy manner.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/khugepaged.h |  4 ++
 mm/khugepaged.c            | 89 +++++++++++++++++++++++++++++++++++---
 2 files changed, 86 insertions(+), 7 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 1f46046080f5..1fe0c4fc9d37 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -1,6 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
+#define MIN_MTHP_ORDER	3
+#define MIN_MTHP_NR	(1<<MIN_MTHP_ORDER)
+#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE) - MIN_MTHP_ORDER))
+#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
 
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3776055bd477..c8048d9ec7fb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+struct scan_bit_state {
+	u8 order;
+	u16 offset;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -102,6 +107,15 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
+	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
+	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
+};
+
+struct collapse_control khugepaged_collapse_control = {
+	.is_khugepaged = true,
 };
 
 /**
@@ -851,10 +865,6 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-struct collapse_control khugepaged_collapse_control = {
-	.is_khugepaged = true,
-};
-
 static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
@@ -1112,7 +1122,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, bool *mmap_locked,
+				  u8 order, u16 offset)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1130,8 +1141,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * The allocation can take potentially a long time if it involves
 	 * sync compaction, and we do not need to hold the mmap_lock during
 	 * that. We will recheck the vma after taking it again in write mode.
+	 * If collapsing mTHPs we may have already released the read_lock.
 	 */
-	mmap_read_unlock(mm);
+	if (*mmap_locked) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+	}
 
 	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
@@ -1266,12 +1281,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
+	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
 	return result;
 }
 
+// Recursive function to consume the bitmap
+static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
+			int referenced, int unmapped, struct collapse_control *cc,
+			bool *mmap_locked, unsigned long enabled_orders)
+{
+	u8 order, next_order;
+	u16 offset, mid_offset;
+	int num_chunks;
+	int bits_set, threshold_bits;
+	int top = -1;
+	int collapsed = 0;
+	int ret;
+	struct scan_bit_state state;
+
+	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+		{ HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
+
+	while (top >= 0) {
+		state = cc->mthp_bitmap_stack[top--];
+		order = state.order + MIN_MTHP_ORDER;
+		offset = state.offset;
+		num_chunks = 1 << (state.order);
+		// Skip mTHP orders that are not enabled
+		if (!test_bit(order, &enabled_orders))
+			goto next;
+
+		// copy the relavant section to a new bitmap
+		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
+				  MTHP_BITMAP_SIZE);
+
+		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
+		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
+				>> (HPAGE_PMD_ORDER - state.order);
+
+		//Check if the region is "almost full" based on the threshold
+		if (bits_set > threshold_bits
+			|| test_bit(order, &huge_anon_orders_always)) {
+			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
+					mmap_locked, order, offset * MIN_MTHP_NR);
+			if (ret == SCAN_SUCCEED) {
+				collapsed += (1 << order);
+				continue;
+			}
+		}
+
+next:
+		if (state.order > 0) {
+			next_order = state.order - 1;
+			mid_offset = offset + (num_chunks / 2);
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, mid_offset };
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, offset };
+			}
+	}
+	return collapsed;
+}
+
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
@@ -1440,7 +1514,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
+					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
 		/* collapse_huge_page will return with the mmap_lock released */
 		*mmap_locked = false;
 	}
@@ -2856,6 +2930,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	mmdrop(mm);
 	kfree(cc);
 
+
 	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
 			: madvise_collapse_errno(last_fail);
 }
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (5 preceding siblings ...)
  2025-02-11  0:30 ` [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-12 17:04   ` Usama Arif
                     ` (4 more replies)
  2025-02-11  0:30 ` [RFC v2 8/9] khugepaged: improve tracepoints for mTHP orders Nico Pache
                   ` (5 subsequent siblings)
  12 siblings, 5 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning a PMD range for potential collapse candidates, keep track
of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
of max_ptes_none during the scan phase so we dont bailout early and miss
potential mTHP candidates.

After the scan is complete we will perform binary recursion on the
bitmap to determine which mTHP size would be most efficient to collapse
to. max_ptes_none will be scaled by the attempted collapse order to
determine how full a THP must be to be eligible.

If a mTHP collapse is attempted, but contains swapped out, or shared
pages, we dont perform the collapse.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 83 insertions(+), 39 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c8048d9ec7fb..cd310989725b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	unsigned long _address = address + offset * PAGE_SIZE;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	/*
@@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		*mmap_locked = false;
 	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-				referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
+				referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
+				_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+
 	/*
 	 * This removes any huge TLB entry from the CPU so we won't allow
 	 * huge and small TLB entries for the same virtual address to
@@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-					&compound_pagelist, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
+					&compound_pagelist, order);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   &compound_pagelist, HPAGE_PMD_ORDER);
+					   vma, _address, pte_ptl,
+					   &compound_pagelist, order);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
@@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
-	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
-	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	set_pmd_at(mm, address, pmd, _pmd);
-	update_mmu_cache_pmd(vma, address, pmd);
-	deferred_split_folio(folio, false);
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
+		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
+		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, address, pmd, _pmd);
+		update_mmu_cache_pmd(vma, address, pmd);
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	} else { //mTHP
+		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		folio_ref_add(folio, (1 << order) - 1);
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		spin_lock(pte_ptl);
+		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
+		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
+		spin_unlock(pte_ptl);
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
@@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
+	int i;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long _address;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
 	bool writable = false;
-
+	int chunk_none_count = 0;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, _address += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		if (i % MIN_MTHP_NR == 0)
+			chunk_none_count = 0;
+
+		_pte = pte + i;
+		_address = address + i * PAGE_SIZE;
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
@@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			}
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			++chunk_none_count;
 			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
+			if (userfaultfd_armed(vma)) {
 				result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out_unmap;
 			}
+			continue;
 		}
 		if (pte_uffd_wp(pteval)) {
 			/*
@@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
 								     address)))
 			referenced++;
+
+		/*
+		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
+		 * pages keep track of it in the bitmap for mTHP collapsing.
+		 */
+		if (chunk_none_count < scaled_none &&
+			(i + 1) % MIN_MTHP_NR == 0)
+			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
 	}
+
 	if (!writable) {
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
@@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			tva_flags, THP_ORDERS_ALL_ANON);
+		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
+			       mmap_locked, enabled_orders);
+		if (result > 0)
+			result = SCAN_SUCCEED;
+		else
+			result = SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
@@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
 			fput(file);
 			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
 				mmap_read_lock(mm);
+				*mmap_locked = true;
 				if (khugepaged_test_exit_or_disable(mm))
 					goto end;
 				result = collapse_pte_mapped_thp(mm, addr,
 								 !cc->is_khugepaged);
 				mmap_read_unlock(mm);
+				*mmap_locked = false;
 			}
 		} else {
 			result = khugepaged_scan_pmd(mm, vma, addr,
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 8/9] khugepaged: improve tracepoints for mTHP orders
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (6 preceding siblings ...)
  2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-11  0:30 ` [RFC v2 9/9] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

Add the order to the tracepoints to give better insight into what order
is being operated at for khugepaged.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
 mm/khugepaged.c                    | 10 +++++----
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 9d5c00b0285c..ea2fe20a39f5 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -92,34 +92,37 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 TRACE_EVENT(mm_collapse_huge_page,
 
-	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+	TP_PROTO(struct mm_struct *mm, int isolated, int status, int order),
 
-	TP_ARGS(mm, isolated, status),
+	TP_ARGS(mm, isolated, status, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, isolated)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->isolated = isolated;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, isolated=%d, status=%s",
+	TP_printk("mm=%p, isolated=%d, status=%s order=%d",
 		__entry->mm,
 		__entry->isolated,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct page *page, int none_or_zero,
-		 int referenced, bool  writable, int status),
+		 int referenced, bool  writable, int status, int order),
 
-	TP_ARGS(page, none_or_zero, referenced, writable, status),
+	TP_ARGS(page, none_or_zero, referenced, writable, status, order),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
@@ -127,6 +130,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__field(int, referenced)
 		__field(bool, writable)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -135,27 +139,31 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->referenced = referenced;
 		__entry->writable = writable;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s order=%d",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
 		__entry->writable,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_swapin,
 
-	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+			int order),
 
-	TP_ARGS(mm, swapped_in, referenced, ret),
+	TP_ARGS(mm, swapped_in, referenced, ret, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, swapped_in)
 		__field(int, referenced)
 		__field(int, ret)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -163,13 +171,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->swapped_in = swapped_in;
 		__entry->referenced = referenced;
 		__entry->ret = ret;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%d",
 		__entry->mm,
 		__entry->swapped_in,
 		__entry->referenced,
-		__entry->ret)
+		__entry->ret,
+		__entry->order)
 );
 
 TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cd310989725b..e2ba18e57064 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -713,13 +713,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero,
-						    referenced, writable, result);
+						    referenced, writable, result,
+						    order);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero,
-					    referenced, writable, result);
+					    referenced, writable, result, order);
 	return result;
 }
 
@@ -1088,7 +1089,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 	result = SCAN_SUCCEED;
 out:
-	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+						order);
 	return result;
 }
 
@@ -1305,7 +1307,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
-	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
 	return result;
 }
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 9/9] khugepaged: skip collapsing mTHP to smaller orders
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (7 preceding siblings ...)
  2025-02-11  0:30 ` [RFC v2 8/9] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-02-11  0:30 ` Nico Pache
  2025-02-19 16:57   ` Ryan Roberts
  2025-02-11 12:49 ` [RFC v2 0/9] khugepaged: mTHP support Dev Jain
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-11  0:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm, rostedt, mathieu.desnoyers, tiwai

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e2ba18e57064..fc30698b8e6e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -622,6 +622,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
+
 		/* See khugepaged_scan_pmd(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (8 preceding siblings ...)
  2025-02-11  0:30 ` [RFC v2 9/9] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-02-11 12:49 ` Dev Jain
  2025-02-12 16:49   ` Nico Pache
  2025-02-17  6:39 ` Dev Jain
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-02-11 12:49 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/25 6:00 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> constantly promote mTHPs to the next available size.
> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>   Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>     monitoring a number of stats and tracepoints. The code is available
>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>     The summary from my testings was that there was no significant regression
>     noticed through this test. In some cases my changes had better collapse
>     latencies, and was able to scan more pages in the same amount of time/work,
>     but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>    (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>     the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>     some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
>      madvise_collapse
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: improve tracepoints for mTHP orders
>    khugepaged: skip collapsing mTHP to smaller orders
> 
>   include/linux/khugepaged.h         |   4 +
>   include/trace/events/huge_memory.h |  34 ++-
>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>   3 files changed, 306 insertions(+), 154 deletions(-)
> 

Does this patchset suffer from the problem described here:
https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-11 12:49 ` [RFC v2 0/9] khugepaged: mTHP support Dev Jain
@ 2025-02-12 16:49   ` Nico Pache
  2025-02-13  8:26     ` Dev Jain
  0 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-12 16:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/02/25 6:00 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> > order to determine how full a THP must be to be eligible. If a mTHP collapse
> > is attempted, but contains swapped out, or shared pages, we dont perform the
> > collapse.
> >
> > With the default max_ptes_none=511, the code should keep its most of its
> > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> > constantly promote mTHPs to the next available size.
> >
> > Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> > Patch 2:     Refactor/rename hpage_collapse
> > Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> > Patch 6-9:   The mTHP patches
> >
> > ---------
> >   Testing
> > ---------
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - I created a test script that I used to push khugepaged to its limits while
> >     monitoring a number of stats and tracepoints. The code is available
> >     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >     The summary from my testings was that there was no significant regression
> >     noticed through this test. In some cases my changes had better collapse
> >     latencies, and was able to scan more pages in the same amount of time/work,
> >     but for the most part the results were consistant.
> > - redis testing. I tested these changes along with my defer changes
> >    (see followup post for more details).
> > - some basic testing on 64k page size.
> > - lots of general use. These changes have been running in my VM for some time.
> >
> > Changes since V1 [2]:
> > - Minor bug fixes discovered during review and testing
> > - removed dynamic allocations for bitmaps, and made them stack based
> > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> > - Updated trace events to include collapsing order info.
> > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> > - No longer require a chunk to be fully utilized before setting the bit. Use
> >     the same max_ptes_none scaling principle to achieve this.
> > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >     some of the "creep" that was discovered in v1.
> >
> > [1] - https://gitlab.com/npache/khugepaged_mthp_test
> > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >
> > Nico Pache (9):
> >    introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >      madvise_collapse
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: improve tracepoints for mTHP orders
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h         |   4 +
> >   include/trace/events/huge_memory.h |  34 ++-
> >   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >   3 files changed, 306 insertions(+), 154 deletions(-)
> >
>
> Does this patchset suffer from the problem described here:
> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
Hi Dev,

Sorry I meant to get back to you about that.

I understand your concern, but like I've mentioned before, the scan
with the read lock was done so we dont have to do the more expensive
locking, and could still gain insight into the state. You are right
that this info could become stale if the state changes dramatically,
but the collapse_isolate function will verify it and not collapse.
From my testing I found this to rarely happen.

Also, khugepaged, my changes, and your changes are all a victim of
this. Once we drop the read lock (to either allocate the folio, or
right before acquiring the write_lock), the state can change. In your
case, yes, you are gathering more up to date information, but is it
really that important/worth it to retake locks and rescan for each
instance if we are about to reverify with the write lock taken?

So in my eyes, this is not a "problem"

Cheers,
-- Nico


>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
@ 2025-02-12 17:04   ` Usama Arif
  2025-02-12 18:16     ` Nico Pache
  2025-02-17 20:55   ` Usama Arif
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Usama Arif @ 2025-02-12 17:04 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai, Johannes Weiner



On 11/02/2025 00:30, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning a PMD range for potential collapse candidates, keep track
> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> of max_ptes_none during the scan phase so we dont bailout early and miss
> potential mTHP candidates.
> 
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
> 
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 83 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c8048d9ec7fb..cd310989725b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;
>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	int result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	/*
> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		*mmap_locked = false;
>  	}
>  
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
>  
>  	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>  	/*
>  	 * Prevent all access to pagetables with the exception of
>  	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	vma_start_write(vma);
>  	anon_vma_lock_write(vma->anon_vma);
>  
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));
>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>  	/*
>  	 * This removes any huge TLB entry from the CPU so we won't allow
>  	 * huge and small TLB entries for the same virtual address to
> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>  
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	anon_vma_unlock_write(vma->anon_vma);
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { //mTHP
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);
> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		spin_lock(pte_ptl);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +		spin_unlock(pte_ptl);
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		deferred_split_folio(folio, false);


Hi Nico,

This patch will have the same issue as the one I pointed out in
https://lore.kernel.org/all/82b9efd1-f2a6-4452-b2ea-6c163e17cdf7@gmail.com/ ?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-12 17:04   ` Usama Arif
@ 2025-02-12 18:16     ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-12 18:16 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai, Johannes Weiner

On Wed, Feb 12, 2025 at 10:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning a PMD range for potential collapse candidates, keep track
> > of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> > utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> > of max_ptes_none during the scan phase so we dont bailout early and miss
> > potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 83 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c8048d9ec7fb..cd310989725b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       /*
> > @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
> > @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> > @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       pte_unmap(pte);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> > @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { //mTHP
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             spin_lock(pte_ptl);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +             spin_unlock(pte_ptl);
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             deferred_split_folio(folio, false);
>
>
> Hi Nico,
>
> This patch will have the same issue as the one I pointed out in
> https://lore.kernel.org/all/82b9efd1-f2a6-4452-b2ea-6c163e17cdf7@gmail.com/ ?

Hi Usama,

Yes thanks for pointing that out! I'll make sure to remove the
deferred_split_folio call.

>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-12 16:49   ` Nico Pache
@ 2025-02-13  8:26     ` Dev Jain
  2025-02-13 11:21       ` Dev Jain
  2025-02-13 19:39       ` Nico Pache
  0 siblings, 2 replies; 55+ messages in thread
From: Dev Jain @ 2025-02-13  8:26 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 12/02/25 10:19 pm, Nico Pache wrote:
> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 11/02/25 6:00 am, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>> collapse.
>>>
>>> With the default max_ptes_none=511, the code should keep its most of its
>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>> constantly promote mTHPs to the next available size.
>>>
>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 2:     Refactor/rename hpage_collapse
>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>> Patch 6-9:   The mTHP patches
>>>
>>> ---------
>>>    Testing
>>> ---------
>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>> - selftests mm
>>> - I created a test script that I used to push khugepaged to its limits while
>>>      monitoring a number of stats and tracepoints. The code is available
>>>      here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>      The summary from my testings was that there was no significant regression
>>>      noticed through this test. In some cases my changes had better collapse
>>>      latencies, and was able to scan more pages in the same amount of time/work,
>>>      but for the most part the results were consistant.
>>> - redis testing. I tested these changes along with my defer changes
>>>     (see followup post for more details).
>>> - some basic testing on 64k page size.
>>> - lots of general use. These changes have been running in my VM for some time.
>>>
>>> Changes since V1 [2]:
>>> - Minor bug fixes discovered during review and testing
>>> - removed dynamic allocations for bitmaps, and made them stack based
>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>> - Updated trace events to include collapsing order info.
>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>      the same max_ptes_none scaling principle to achieve this.
>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>      some of the "creep" that was discovered in v1.
>>>
>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>
>>> Nico Pache (9):
>>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>       madvise_collapse
>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>     khugepaged: add mTHP support
>>>     khugepaged: improve tracepoints for mTHP orders
>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>    include/linux/khugepaged.h         |   4 +
>>>    include/trace/events/huge_memory.h |  34 ++-
>>>    mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>    3 files changed, 306 insertions(+), 154 deletions(-)
>>>
>>
>> Does this patchset suffer from the problem described here:
>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
> Hi Dev,
> 
> Sorry I meant to get back to you about that.
> 
> I understand your concern, but like I've mentioned before, the scan
> with the read lock was done so we dont have to do the more expensive
> locking, and could still gain insight into the state. You are right
> that this info could become stale if the state changes dramatically,
> but the collapse_isolate function will verify it and not collapse.

If the state changes dramatically, the _isolate function will verify it, 
and fallback. And this fallback happens after following this costly 
path: retrieve a large folio from the buddy allocator -> swapin pages 
from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush 
on all CPUs -> fallback in _isolate().
If you do fail in _isolate(), doesn't it make sense to get the updated 
state for the next fallback order immediately, because we have prior 
information that we failed because of PTE state? What your algorithm 
will do is *still* follow the costly path described above, and again 
fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like 
mine would.

The verification of the PTE state by the _isolate() function is the "no 
turning back" point of the algorithm. The verification by 
hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth 
it, before we do costly operations" point of the algorithm.

>  From my testing I found this to rarely happen.

Unfortunately, I am not very familiar with performance testing/load 
testing, I am fairly new to kernel programming, so I am getting there. 
But it really depends on the type of test you are running, what actually 
runs on memory-intensive systems, etc etc. In fact, on loaded systems I 
would expect the PTE state to dramatically change. But still, no opinion 
here.

> 
> Also, khugepaged, my changes, and your changes are all a victim of
> this. Once we drop the read lock (to either allocate the folio, or
> right before acquiring the write_lock), the state can change. In your
> case, yes, you are gathering more up to date information, but is it
> really that important/worth it to retake locks and rescan for each
> instance if we are about to reverify with the write lock taken?

You said "reverify": You are removing the verification, so this step 
won't be reverification, it will be verification. We do not want to 
verify *after* we have already done 95% of latency-heavy stuff, only to 
know that we are going to fail.

Algorithms in the kernel, in general, are of the following form: 1) 
Verify if a condition is true, resulting in taking a control path -> 2) 
do a lot of stuff -> "no turning back" step, wherein before committing 
(by taking locks, say), reverify if this is the control path we should 
be in. You are eliminating step 1).

Therefore, I will have to say that I disagree with your approach.

On top of this, in the subjective analysis in [1], point number 7 (along 
with point number 1) remains. And, point number 4 remains.

[1] 
https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/

> 
> So in my eyes, this is not a "problem"

Looks like the kernel scheduled us for a high-priority debate, I hope 
there's no deadlock :)

> 
> Cheers,
> -- Nico
> 
> 
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-13  8:26     ` Dev Jain
@ 2025-02-13 11:21       ` Dev Jain
  2025-02-13 19:39       ` Nico Pache
  1 sibling, 0 replies; 55+ messages in thread
From: Dev Jain @ 2025-02-13 11:21 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 13/02/25 1:56 pm, Dev Jain wrote:
> 
> 
> On 12/02/25 10:19 pm, Nico Pache wrote:
>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>> The following series provides khugepaged and madvise collapse with the
>>>> capability to collapse regions to mTHPs.
>>>>
>>>> To achieve this we generalize the khugepaged functions to no longer 
>>>> depend
>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of 
>>>> pages
>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>> using a bitmap. After the PMD scan is done, we do binary recursion 
>>>> on the
>>>> bitmap to find the optimal mTHP sizes for the PMD range. The 
>>>> restriction
>>>> on max_ptes_none is removed during the scan, to make sure we account 
>>>> for
>>>> the whole PMD range. max_ptes_none will be scaled by the attempted 
>>>> collapse
>>>> order to determine how full a THP must be to be eligible. If a mTHP 
>>>> collapse
>>>> is attempted, but contains swapped out, or shared pages, we dont 
>>>> perform the
>>>> collapse.
>>>>
>>>> With the default max_ptes_none=511, the code should keep its most of 
>>>> its
>>>> original behavior. To exercise mTHP collapse we need to set 
>>>> max_ptes_none<=255.
>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse 
>>>> "creep" and
>>>> constantly promote mTHPs to the next available size.
>>>>
>>>> Patch 1:     Some refactoring to combine madvise_collapse and 
>>>> khugepaged
>>>> Patch 2:     Refactor/rename hpage_collapse
>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>> Patch 6-9:   The mTHP patches
>>>>
>>>> ---------
>>>>    Testing
>>>> ---------
>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>> - selftests mm
>>>> - I created a test script that I used to push khugepaged to its 
>>>> limits while
>>>>      monitoring a number of stats and tracepoints. The code is 
>>>> available
>>>>      here[1] (Run in legacy mode for these changes and set mthp 
>>>> sizes to inherit)
>>>>      The summary from my testings was that there was no significant 
>>>> regression
>>>>      noticed through this test. In some cases my changes had better 
>>>> collapse
>>>>      latencies, and was able to scan more pages in the same amount 
>>>> of time/work,
>>>>      but for the most part the results were consistant.
>>>> - redis testing. I tested these changes along with my defer changes
>>>>     (see followup post for more details).
>>>> - some basic testing on 64k page size.
>>>> - lots of general use. These changes have been running in my VM for 
>>>> some time.
>>>>
>>>> Changes since V1 [2]:
>>>> - Minor bug fixes discovered during review and testing
>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>> - Updated trace events to include collapsing order info.
>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>> - No longer require a chunk to be fully utilized before setting the 
>>>> bit. Use
>>>>      the same max_ptes_none scaling principle to achieve this.
>>>> - Skip mTHP collapse that requires swapin or shared handling. This 
>>>> helps prevent
>>>>      some of the "creep" that was discovered in v1.
>>>>
>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1- 
>>>> npache@redhat.com/
>>>>
>>>> Nico Pache (9):
>>>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>       madvise_collapse
>>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>     khugepaged: add mTHP support
>>>>     khugepaged: improve tracepoints for mTHP orders
>>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>>
>>>>    include/linux/khugepaged.h         |   4 +
>>>>    include/trace/events/huge_memory.h |  34 ++-
>>>>    mm/khugepaged.c                    | 422 ++++++++++++++++++ 
>>>> +----------
>>>>    3 files changed, 306 insertions(+), 154 deletions(-)
>>>>
>>>
>>> Does this patchset suffer from the problem described here:
>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680- 
>>> c2d48d4963b6@arm.com/
>> Hi Dev,
>>
>> Sorry I meant to get back to you about that.
>>
>> I understand your concern, but like I've mentioned before, the scan
>> with the read lock was done so we dont have to do the more expensive
>> locking, and could still gain insight into the state. You are right
>> that this info could become stale if the state changes dramatically,
>> but the collapse_isolate function will verify it and not collapse.
> 
> If the state changes dramatically, the _isolate function will verify it, 
> and fallback. And this fallback happens after following this costly 
> path: retrieve a large folio from the buddy allocator -> swapin pages 
> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush 
> on all CPUs -> fallback in _isolate().
> If you do fail in _isolate(), doesn't it make sense to get the updated 
> state for the next fallback order immediately, because we have prior 
> information that we failed because of PTE state? What your algorithm 
> will do is *still* follow the costly path described above, and again 
> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like 
> mine would.
> 
> The verification of the PTE state by the _isolate() function is the "no 
> turning back" point of the algorithm. The verification by 
> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth 
> it, before we do costly operations" point of the algorithm.
> 
>>  From my testing I found this to rarely happen.
> 
> Unfortunately, I am not very familiar with performance testing/load 
> testing, I am fairly new to kernel programming, so I am getting there. 
> But it really depends on the type of test you are running, what actually 
> runs on memory-intensive systems, etc etc. In fact, on loaded systems I 
> would expect the PTE state to dramatically change. But still, no opinion 
> here.
> 
>>
>> Also, khugepaged, my changes, and your changes are all a victim of
>> this. Once we drop the read lock (to either allocate the folio, or
>> right before acquiring the write_lock), the state can change. In your
>> case, yes, you are gathering more up to date information, but is it
>> really that important/worth it to retake locks and rescan for each
>> instance if we are about to reverify with the write lock taken?
> 
> You said "reverify": You are removing the verification, so this step 
> won't be reverification, it will be verification. We do not want to 
> verify *after* we have already done 95% of latency-heavy stuff, only to 
> know that we are going to fail.
> 
> Algorithms in the kernel, in general, are of the following form: 1) 
> Verify if a condition is true, resulting in taking a control path -> 2) 
> do a lot of stuff -> "no turning back" step, wherein before committing 
> (by taking locks, say), reverify if this is the control path we should 
> be in. You are eliminating step 1).
> 
> Therefore, I will have to say that I disagree with your approach.
> 
> On top of this, in the subjective analysis in [1], point number 7 (along 
> with point number 1) remains. And, point number 4 remains.
> 
> [1] https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b- 
> aba4b1a441b4@arm.com/
> 
>>
>> So in my eyes, this is not a "problem"
> 
> Looks like the kernel scheduled us for a high-priority debate, I hope 
> there's no deadlock :)
> 
>>
>> Cheers,
>> -- Nico
>>
>>

In any case, As Andrew notes, the first step is to justify that this 
functionality is beneficial to get in. The second step would be to 
compare the implementations. It would be great if someone from the 
community can jump in to test these things out : ) but in the meantime
I will begin working on the first step.

>>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-13  8:26     ` Dev Jain
  2025-02-13 11:21       ` Dev Jain
@ 2025-02-13 19:39       ` Nico Pache
  2025-02-14  2:01         ` Dev Jain
  1 sibling, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-13 19:39 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 12/02/25 10:19 pm, Nico Pache wrote:
> > On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 11/02/25 6:00 am, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> >>> order to determine how full a THP must be to be eligible. If a mTHP collapse
> >>> is attempted, but contains swapped out, or shared pages, we dont perform the
> >>> collapse.
> >>>
> >>> With the default max_ptes_none=511, the code should keep its most of its
> >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> >>> constantly promote mTHPs to the next available size.
> >>>
> >>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 2:     Refactor/rename hpage_collapse
> >>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>> Patch 6-9:   The mTHP patches
> >>>
> >>> ---------
> >>>    Testing
> >>> ---------
> >>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>> - selftests mm
> >>> - I created a test script that I used to push khugepaged to its limits while
> >>>      monitoring a number of stats and tracepoints. The code is available
> >>>      here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >>>      The summary from my testings was that there was no significant regression
> >>>      noticed through this test. In some cases my changes had better collapse
> >>>      latencies, and was able to scan more pages in the same amount of time/work,
> >>>      but for the most part the results were consistant.
> >>> - redis testing. I tested these changes along with my defer changes
> >>>     (see followup post for more details).
> >>> - some basic testing on 64k page size.
> >>> - lots of general use. These changes have been running in my VM for some time.
> >>>
> >>> Changes since V1 [2]:
> >>> - Minor bug fixes discovered during review and testing
> >>> - removed dynamic allocations for bitmaps, and made them stack based
> >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>> - Updated trace events to include collapsing order info.
> >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> >>> - No longer require a chunk to be fully utilized before setting the bit. Use
> >>>      the same max_ptes_none scaling principle to achieve this.
> >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >>>      some of the "creep" that was discovered in v1.
> >>>
> >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >>>
> >>> Nico Pache (9):
> >>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >>>       madvise_collapse
> >>>     khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>     khugepaged: generalize alloc_charge_folio for mTHP support
> >>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>     khugepaged: add mTHP support
> >>>     khugepaged: improve tracepoints for mTHP orders
> >>>     khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>    include/linux/khugepaged.h         |   4 +
> >>>    include/trace/events/huge_memory.h |  34 ++-
> >>>    mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >>>    3 files changed, 306 insertions(+), 154 deletions(-)
> >>>
> >>
> >> Does this patchset suffer from the problem described here:
> >> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
> > Hi Dev,
> >
> > Sorry I meant to get back to you about that.
> >
> > I understand your concern, but like I've mentioned before, the scan
> > with the read lock was done so we dont have to do the more expensive
> > locking, and could still gain insight into the state. You are right
> > that this info could become stale if the state changes dramatically,
> > but the collapse_isolate function will verify it and not collapse.
>
> If the state changes dramatically, the _isolate function will verify it,
> and fallback. And this fallback happens after following this costly
> path: retrieve a large folio from the buddy allocator -> swapin pages
> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
> on all CPUs -> fallback in _isolate().
> If you do fail in _isolate(), doesn't it make sense to get the updated
> state for the next fallback order immediately, because we have prior
> information that we failed because of PTE state? What your algorithm
> will do is *still* follow the costly path described above, and again
> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
> mine would.

You do raise a valid point here, I can optimize my solution by
detecting certain collapse failure types and jump to the next scan.
I'll add that to my solution, thanks!

As for the disagreement around the bitmap, we'll leave that up to the
community to decide since we have differing opinions/solutions.

>
> The verification of the PTE state by the _isolate() function is the "no
> turning back" point of the algorithm. The verification by
> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
> it, before we do costly operations" point of the algorithm.
>
> >  From my testing I found this to rarely happen.
>
> Unfortunately, I am not very familiar with performance testing/load
> testing, I am fairly new to kernel programming, so I am getting there.
> But it really depends on the type of test you are running, what actually
> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
> would expect the PTE state to dramatically change. But still, no opinion
> here.

Yeah there are probably some cases where it happens more often.
Probably in cases of short lived allocations, but khugepaged doesn't
run that frequently so those won't be that big of an issue.

Our performance team is currently testing my implementation so I
should have more real workload test results soon. The redis testing
had some gains and didn't show any signs of obvious regressions.

As for the testing, check out
https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
this does the tracing for my testing script. It can help you get
started. There are 3 different traces being applied there: the
bpftrace for collapse latencies, the perf record for the flamegraph
(not actually that useful, but may be useful to visualize any
weird/long paths that you may not have noticed), and the trace-cmd
which records the tracepoint of the scan and the collapse functions
then processes the data using the awk script-- the output being the
scan rate, the pages collapsed, and their result status (grouped by
order).

You can also look into https://github.com/gormanm/mmtests for
testing/comparing kernels. I was running the
config-memdb-redis-benchmark-medium workload.

>
> >
> > Also, khugepaged, my changes, and your changes are all a victim of
> > this. Once we drop the read lock (to either allocate the folio, or
> > right before acquiring the write_lock), the state can change. In your
> > case, yes, you are gathering more up to date information, but is it
> > really that important/worth it to retake locks and rescan for each
> > instance if we are about to reverify with the write lock taken?
>
> You said "reverify": You are removing the verification, so this step
> won't be reverification, it will be verification. We do not want to
> verify *after* we have already done 95% of latency-heavy stuff, only to
> know that we are going to fail.
>
> Algorithms in the kernel, in general, are of the following form: 1)
> Verify if a condition is true, resulting in taking a control path -> 2)
> do a lot of stuff -> "no turning back" step, wherein before committing
> (by taking locks, say), reverify if this is the control path we should
> be in. You are eliminating step 1).
>
> Therefore, I will have to say that I disagree with your approach.
>
> On top of this, in the subjective analysis in [1], point number 7 (along
> with point number 1) remains. And, point number 4 remains.

for 1) your worst case of 1024 is not the worst case. There are 8
possible orders in your implementation, if all are enabled, that is
4096 iterations in the worst case.
This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.
>
> [1]
> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>
> >
> > So in my eyes, this is not a "problem"
>
> Looks like the kernel scheduled us for a high-priority debate, I hope
> there's no deadlock :)
>
> >
> > Cheers,
> > -- Nico
> >
> >
> >>
> >
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-13 19:39       ` Nico Pache
@ 2025-02-14  2:01         ` Dev Jain
  2025-02-15  0:52           ` Nico Pache
  0 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-02-14  2:01 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 14/02/25 1:09 am, Nico Pache wrote:
> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 12/02/25 10:19 pm, Nico Pache wrote:
>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>>> The following series provides khugepaged and madvise collapse with the
>>>>> capability to collapse regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>>>> collapse.
>>>>>
>>>>> With the default max_ptes_none=511, the code should keep its most of its
>>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>>>> constantly promote mTHPs to the next available size.
>>>>>
>>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>> Patch 6-9:   The mTHP patches
>>>>>
>>>>> ---------
>>>>>     Testing
>>>>> ---------
>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>> - selftests mm
>>>>> - I created a test script that I used to push khugepaged to its limits while
>>>>>       monitoring a number of stats and tracepoints. The code is available
>>>>>       here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>>>       The summary from my testings was that there was no significant regression
>>>>>       noticed through this test. In some cases my changes had better collapse
>>>>>       latencies, and was able to scan more pages in the same amount of time/work,
>>>>>       but for the most part the results were consistant.
>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>      (see followup post for more details).
>>>>> - some basic testing on 64k page size.
>>>>> - lots of general use. These changes have been running in my VM for some time.
>>>>>
>>>>> Changes since V1 [2]:
>>>>> - Minor bug fixes discovered during review and testing
>>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>> - Updated trace events to include collapsing order info.
>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>>>       the same max_ptes_none scaling principle to achieve this.
>>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>>>       some of the "creep" that was discovered in v1.
>>>>>
>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>>>
>>>>> Nico Pache (9):
>>>>>      introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>>        madvise_collapse
>>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>      khugepaged: add mTHP support
>>>>>      khugepaged: improve tracepoints for mTHP orders
>>>>>      khugepaged: skip collapsing mTHP to smaller orders
>>>>>
>>>>>     include/linux/khugepaged.h         |   4 +
>>>>>     include/trace/events/huge_memory.h |  34 ++-
>>>>>     mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>>>     3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>
>>>>
>>>> Does this patchset suffer from the problem described here:
>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
>>> Hi Dev,
>>>
>>> Sorry I meant to get back to you about that.
>>>
>>> I understand your concern, but like I've mentioned before, the scan
>>> with the read lock was done so we dont have to do the more expensive
>>> locking, and could still gain insight into the state. You are right
>>> that this info could become stale if the state changes dramatically,
>>> but the collapse_isolate function will verify it and not collapse.
>>
>> If the state changes dramatically, the _isolate function will verify it,
>> and fallback. And this fallback happens after following this costly
>> path: retrieve a large folio from the buddy allocator -> swapin pages
>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
>> on all CPUs -> fallback in _isolate().
>> If you do fail in _isolate(), doesn't it make sense to get the updated
>> state for the next fallback order immediately, because we have prior
>> information that we failed because of PTE state? What your algorithm
>> will do is *still* follow the costly path described above, and again
>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
>> mine would.
> 
> You do raise a valid point here, I can optimize my solution by
> detecting certain collapse failure types and jump to the next scan.
> I'll add that to my solution, thanks!
> 
> As for the disagreement around the bitmap, we'll leave that up to the
> community to decide since we have differing opinions/solutions.
> 
>>
>> The verification of the PTE state by the _isolate() function is the "no
>> turning back" point of the algorithm. The verification by
>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
>> it, before we do costly operations" point of the algorithm.
>>
>>>   From my testing I found this to rarely happen.
>>
>> Unfortunately, I am not very familiar with performance testing/load
>> testing, I am fairly new to kernel programming, so I am getting there.
>> But it really depends on the type of test you are running, what actually
>> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
>> would expect the PTE state to dramatically change. But still, no opinion
>> here.
> 
> Yeah there are probably some cases where it happens more often.
> Probably in cases of short lived allocations, but khugepaged doesn't
> run that frequently so those won't be that big of an issue.
> 
> Our performance team is currently testing my implementation so I
> should have more real workload test results soon. The redis testing
> had some gains and didn't show any signs of obvious regressions.
> 
> As for the testing, check out
> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
> this does the tracing for my testing script. It can help you get
> started. There are 3 different traces being applied there: the
> bpftrace for collapse latencies, the perf record for the flamegraph
> (not actually that useful, but may be useful to visualize any
> weird/long paths that you may not have noticed), and the trace-cmd
> which records the tracepoint of the scan and the collapse functions
> then processes the data using the awk script-- the output being the
> scan rate, the pages collapsed, and their result status (grouped by
> order).
> 
> You can also look into https://github.com/gormanm/mmtests for
> testing/comparing kernels. I was running the
> config-memdb-redis-benchmark-medium workload.

Thanks. I'll take a look.

> 
>>
>>>
>>> Also, khugepaged, my changes, and your changes are all a victim of
>>> this. Once we drop the read lock (to either allocate the folio, or
>>> right before acquiring the write_lock), the state can change. In your
>>> case, yes, you are gathering more up to date information, but is it
>>> really that important/worth it to retake locks and rescan for each
>>> instance if we are about to reverify with the write lock taken?
>>
>> You said "reverify": You are removing the verification, so this step
>> won't be reverification, it will be verification. We do not want to
>> verify *after* we have already done 95% of latency-heavy stuff, only to
>> know that we are going to fail.
>>
>> Algorithms in the kernel, in general, are of the following form: 1)
>> Verify if a condition is true, resulting in taking a control path -> 2)
>> do a lot of stuff -> "no turning back" step, wherein before committing
>> (by taking locks, say), reverify if this is the control path we should
>> be in. You are eliminating step 1).
>>
>> Therefore, I will have to say that I disagree with your approach.
>>
>> On top of this, in the subjective analysis in [1], point number 7 (along
>> with point number 1) remains. And, point number 4 remains.
> 
> for 1) your worst case of 1024 is not the worst case. There are 8
> possible orders in your implementation, if all are enabled, that is
> 4096 iterations in the worst case.

Yes, that is exactly what I wrote in 1). I am still not convinced that 
the overhead you produce + 512 iterations is going to beat 4096 
iterations. Anyways, that is hand-waving and we should test this.

> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.

Sorry, I am missing something here; how does the number of iterations 
change with page size? Am I not scanning the PTE table, which is 
invariant to the page size?

>>
>> [1]
>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>
>>>
>>> So in my eyes, this is not a "problem"
>>
>> Looks like the kernel scheduled us for a high-priority debate, I hope
>> there's no deadlock :)
>>
>>>
>>> Cheers,
>>> -- Nico
>>>
>>>
>>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-14  2:01         ` Dev Jain
@ 2025-02-15  0:52           ` Nico Pache
  2025-02-15  6:38             ` Dev Jain
  0 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-15  0:52 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 14/02/25 1:09 am, Nico Pache wrote:
> > On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 12/02/25 10:19 pm, Nico Pache wrote:
> >>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 11/02/25 6:00 am, Nico Pache wrote:
> >>>>> The following series provides khugepaged and madvise collapse with the
> >>>>> capability to collapse regions to mTHPs.
> >>>>>
> >>>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>>>> on max_ptes_none is removed during the scan, to make sure we account for
> >>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> >>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
> >>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
> >>>>> collapse.
> >>>>>
> >>>>> With the default max_ptes_none=511, the code should keep its most of its
> >>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> >>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> >>>>> constantly promote mTHPs to the next available size.
> >>>>>
> >>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> >>>>> Patch 2:     Refactor/rename hpage_collapse
> >>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>>>> Patch 6-9:   The mTHP patches
> >>>>>
> >>>>> ---------
> >>>>>     Testing
> >>>>> ---------
> >>>>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>>>> - selftests mm
> >>>>> - I created a test script that I used to push khugepaged to its limits while
> >>>>>       monitoring a number of stats and tracepoints. The code is available
> >>>>>       here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >>>>>       The summary from my testings was that there was no significant regression
> >>>>>       noticed through this test. In some cases my changes had better collapse
> >>>>>       latencies, and was able to scan more pages in the same amount of time/work,
> >>>>>       but for the most part the results were consistant.
> >>>>> - redis testing. I tested these changes along with my defer changes
> >>>>>      (see followup post for more details).
> >>>>> - some basic testing on 64k page size.
> >>>>> - lots of general use. These changes have been running in my VM for some time.
> >>>>>
> >>>>> Changes since V1 [2]:
> >>>>> - Minor bug fixes discovered during review and testing
> >>>>> - removed dynamic allocations for bitmaps, and made them stack based
> >>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>>>> - Updated trace events to include collapsing order info.
> >>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> >>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
> >>>>>       the same max_ptes_none scaling principle to achieve this.
> >>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >>>>>       some of the "creep" that was discovered in v1.
> >>>>>
> >>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >>>>>
> >>>>> Nico Pache (9):
> >>>>>      introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >>>>>        madvise_collapse
> >>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>>>      khugepaged: add mTHP support
> >>>>>      khugepaged: improve tracepoints for mTHP orders
> >>>>>      khugepaged: skip collapsing mTHP to smaller orders
> >>>>>
> >>>>>     include/linux/khugepaged.h         |   4 +
> >>>>>     include/trace/events/huge_memory.h |  34 ++-
> >>>>>     mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >>>>>     3 files changed, 306 insertions(+), 154 deletions(-)
> >>>>>
> >>>>
> >>>> Does this patchset suffer from the problem described here:
> >>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
> >>> Hi Dev,
> >>>
> >>> Sorry I meant to get back to you about that.
> >>>
> >>> I understand your concern, but like I've mentioned before, the scan
> >>> with the read lock was done so we dont have to do the more expensive
> >>> locking, and could still gain insight into the state. You are right
> >>> that this info could become stale if the state changes dramatically,
> >>> but the collapse_isolate function will verify it and not collapse.
> >>
> >> If the state changes dramatically, the _isolate function will verify it,
> >> and fallback. And this fallback happens after following this costly
> >> path: retrieve a large folio from the buddy allocator -> swapin pages
> >> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
> >> on all CPUs -> fallback in _isolate().
> >> If you do fail in _isolate(), doesn't it make sense to get the updated
> >> state for the next fallback order immediately, because we have prior
> >> information that we failed because of PTE state? What your algorithm
> >> will do is *still* follow the costly path described above, and again
> >> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
> >> mine would.
> >
> > You do raise a valid point here, I can optimize my solution by
> > detecting certain collapse failure types and jump to the next scan.
> > I'll add that to my solution, thanks!
> >
> > As for the disagreement around the bitmap, we'll leave that up to the
> > community to decide since we have differing opinions/solutions.
> >
> >>
> >> The verification of the PTE state by the _isolate() function is the "no
> >> turning back" point of the algorithm. The verification by
> >> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
> >> it, before we do costly operations" point of the algorithm.
> >>
> >>>   From my testing I found this to rarely happen.
> >>
> >> Unfortunately, I am not very familiar with performance testing/load
> >> testing, I am fairly new to kernel programming, so I am getting there.
> >> But it really depends on the type of test you are running, what actually
> >> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
> >> would expect the PTE state to dramatically change. But still, no opinion
> >> here.
> >
> > Yeah there are probably some cases where it happens more often.
> > Probably in cases of short lived allocations, but khugepaged doesn't
> > run that frequently so those won't be that big of an issue.
> >
> > Our performance team is currently testing my implementation so I
> > should have more real workload test results soon. The redis testing
> > had some gains and didn't show any signs of obvious regressions.
> >
> > As for the testing, check out
> > https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
> > this does the tracing for my testing script. It can help you get
> > started. There are 3 different traces being applied there: the
> > bpftrace for collapse latencies, the perf record for the flamegraph
> > (not actually that useful, but may be useful to visualize any
> > weird/long paths that you may not have noticed), and the trace-cmd
> > which records the tracepoint of the scan and the collapse functions
> > then processes the data using the awk script-- the output being the
> > scan rate, the pages collapsed, and their result status (grouped by
> > order).
> >
> > You can also look into https://github.com/gormanm/mmtests for
> > testing/comparing kernels. I was running the
> > config-memdb-redis-benchmark-medium workload.
>
> Thanks. I'll take a look.
>
> >
> >>
> >>>
> >>> Also, khugepaged, my changes, and your changes are all a victim of
> >>> this. Once we drop the read lock (to either allocate the folio, or
> >>> right before acquiring the write_lock), the state can change. In your
> >>> case, yes, you are gathering more up to date information, but is it
> >>> really that important/worth it to retake locks and rescan for each
> >>> instance if we are about to reverify with the write lock taken?
> >>
> >> You said "reverify": You are removing the verification, so this step
> >> won't be reverification, it will be verification. We do not want to
> >> verify *after* we have already done 95% of latency-heavy stuff, only to
> >> know that we are going to fail.
> >>
> >> Algorithms in the kernel, in general, are of the following form: 1)
> >> Verify if a condition is true, resulting in taking a control path -> 2)
> >> do a lot of stuff -> "no turning back" step, wherein before committing
> >> (by taking locks, say), reverify if this is the control path we should
> >> be in. You are eliminating step 1).
> >>
> >> Therefore, I will have to say that I disagree with your approach.
> >>
> >> On top of this, in the subjective analysis in [1], point number 7 (along
> >> with point number 1) remains. And, point number 4 remains.
> >
> > for 1) your worst case of 1024 is not the worst case. There are 8
> > possible orders in your implementation, if all are enabled, that is
> > 4096 iterations in the worst case.
>
> Yes, that is exactly what I wrote in 1). I am still not convinced that
> the overhead you produce + 512 iterations is going to beat 4096
> iterations. Anyways, that is hand-waving and we should test this.
>
> > This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.
>
> Sorry, I am missing something here; how does the number of iterations
> change with page size? Am I not scanning the PTE table, which is
> invariant to the page size?

I got the calculation wrong the first time and it's actually worst.
Lets hope I got this right this time
on ARM64 64k kernel:
PMD size = 512M
PTE= 64k
PTEs per PMD = 8192
log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
first and second order are skipped)

Assuming I understand your algorithm correctly, in the worst case you
are scanning the whole PMD for each order.

So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.

Please let me know if I'm missing something here.
>
> >>
> >> [1]
> >> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
> >>
> >>>
> >>> So in my eyes, this is not a "problem"
> >>
> >> Looks like the kernel scheduled us for a high-priority debate, I hope
> >> there's no deadlock :)
> >>
> >>>
> >>> Cheers,
> >>> -- Nico
> >>>
> >>>
> >>>>
> >>>
> >>
> >
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-15  0:52           ` Nico Pache
@ 2025-02-15  6:38             ` Dev Jain
  2025-02-17  8:05               ` Dev Jain
  0 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-02-15  6:38 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 15/02/25 6:22 am, Nico Pache wrote:
> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 14/02/25 1:09 am, Nico Pache wrote:
>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>>>>> The following series provides khugepaged and madvise collapse with the
>>>>>>> capability to collapse regions to mTHPs.
>>>>>>>
>>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>>>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>>>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>>>>>> collapse.
>>>>>>>
>>>>>>> With the default max_ptes_none=511, the code should keep its most of its
>>>>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>>>>>> constantly promote mTHPs to the next available size.
>>>>>>>
>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>>>> Patch 6-9:   The mTHP patches
>>>>>>>
>>>>>>> ---------
>>>>>>>      Testing
>>>>>>> ---------
>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>>>> - selftests mm
>>>>>>> - I created a test script that I used to push khugepaged to its limits while
>>>>>>>        monitoring a number of stats and tracepoints. The code is available
>>>>>>>        here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>>>>>        The summary from my testings was that there was no significant regression
>>>>>>>        noticed through this test. In some cases my changes had better collapse
>>>>>>>        latencies, and was able to scan more pages in the same amount of time/work,
>>>>>>>        but for the most part the results were consistant.
>>>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>>>       (see followup post for more details).
>>>>>>> - some basic testing on 64k page size.
>>>>>>> - lots of general use. These changes have been running in my VM for some time.
>>>>>>>
>>>>>>> Changes since V1 [2]:
>>>>>>> - Minor bug fixes discovered during review and testing
>>>>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>>>> - Updated trace events to include collapsing order info.
>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>>>>>        the same max_ptes_none scaling principle to achieve this.
>>>>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>>>>>        some of the "creep" that was discovered in v1.
>>>>>>>
>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>>>>>
>>>>>>> Nico Pache (9):
>>>>>>>       introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>>>>         madvise_collapse
>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>>>       khugepaged: add mTHP support
>>>>>>>       khugepaged: improve tracepoints for mTHP orders
>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
>>>>>>>
>>>>>>>      include/linux/khugepaged.h         |   4 +
>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
>>>>>>>      mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>>>
>>>>>>
>>>>>> Does this patchset suffer from the problem described here:
>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
>>>>> Hi Dev,
>>>>>
>>>>> Sorry I meant to get back to you about that.
>>>>>
>>>>> I understand your concern, but like I've mentioned before, the scan
>>>>> with the read lock was done so we dont have to do the more expensive
>>>>> locking, and could still gain insight into the state. You are right
>>>>> that this info could become stale if the state changes dramatically,
>>>>> but the collapse_isolate function will verify it and not collapse.
>>>>
>>>> If the state changes dramatically, the _isolate function will verify it,
>>>> and fallback. And this fallback happens after following this costly
>>>> path: retrieve a large folio from the buddy allocator -> swapin pages
>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
>>>> on all CPUs -> fallback in _isolate().
>>>> If you do fail in _isolate(), doesn't it make sense to get the updated
>>>> state for the next fallback order immediately, because we have prior
>>>> information that we failed because of PTE state? What your algorithm
>>>> will do is *still* follow the costly path described above, and again
>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
>>>> mine would.
>>>
>>> You do raise a valid point here, I can optimize my solution by
>>> detecting certain collapse failure types and jump to the next scan.
>>> I'll add that to my solution, thanks!
>>>
>>> As for the disagreement around the bitmap, we'll leave that up to the
>>> community to decide since we have differing opinions/solutions.
>>>
>>>>
>>>> The verification of the PTE state by the _isolate() function is the "no
>>>> turning back" point of the algorithm. The verification by
>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
>>>> it, before we do costly operations" point of the algorithm.
>>>>
>>>>>    From my testing I found this to rarely happen.
>>>>
>>>> Unfortunately, I am not very familiar with performance testing/load
>>>> testing, I am fairly new to kernel programming, so I am getting there.
>>>> But it really depends on the type of test you are running, what actually
>>>> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
>>>> would expect the PTE state to dramatically change. But still, no opinion
>>>> here.
>>>
>>> Yeah there are probably some cases where it happens more often.
>>> Probably in cases of short lived allocations, but khugepaged doesn't
>>> run that frequently so those won't be that big of an issue.
>>>
>>> Our performance team is currently testing my implementation so I
>>> should have more real workload test results soon. The redis testing
>>> had some gains and didn't show any signs of obvious regressions.
>>>
>>> As for the testing, check out
>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
>>> this does the tracing for my testing script. It can help you get
>>> started. There are 3 different traces being applied there: the
>>> bpftrace for collapse latencies, the perf record for the flamegraph
>>> (not actually that useful, but may be useful to visualize any
>>> weird/long paths that you may not have noticed), and the trace-cmd
>>> which records the tracepoint of the scan and the collapse functions
>>> then processes the data using the awk script-- the output being the
>>> scan rate, the pages collapsed, and their result status (grouped by
>>> order).
>>>
>>> You can also look into https://github.com/gormanm/mmtests for
>>> testing/comparing kernels. I was running the
>>> config-memdb-redis-benchmark-medium workload.
>>
>> Thanks. I'll take a look.
>>
>>>
>>>>
>>>>>
>>>>> Also, khugepaged, my changes, and your changes are all a victim of
>>>>> this. Once we drop the read lock (to either allocate the folio, or
>>>>> right before acquiring the write_lock), the state can change. In your
>>>>> case, yes, you are gathering more up to date information, but is it
>>>>> really that important/worth it to retake locks and rescan for each
>>>>> instance if we are about to reverify with the write lock taken?
>>>>
>>>> You said "reverify": You are removing the verification, so this step
>>>> won't be reverification, it will be verification. We do not want to
>>>> verify *after* we have already done 95% of latency-heavy stuff, only to
>>>> know that we are going to fail.
>>>>
>>>> Algorithms in the kernel, in general, are of the following form: 1)
>>>> Verify if a condition is true, resulting in taking a control path -> 2)
>>>> do a lot of stuff -> "no turning back" step, wherein before committing
>>>> (by taking locks, say), reverify if this is the control path we should
>>>> be in. You are eliminating step 1).
>>>>
>>>> Therefore, I will have to say that I disagree with your approach.
>>>>
>>>> On top of this, in the subjective analysis in [1], point number 7 (along
>>>> with point number 1) remains. And, point number 4 remains.
>>>
>>> for 1) your worst case of 1024 is not the worst case. There are 8
>>> possible orders in your implementation, if all are enabled, that is
>>> 4096 iterations in the worst case.
>>
>> Yes, that is exactly what I wrote in 1). I am still not convinced that
>> the overhead you produce + 512 iterations is going to beat 4096
>> iterations. Anyways, that is hand-waving and we should test this.
>>
>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.
>>
>> Sorry, I am missing something here; how does the number of iterations
>> change with page size? Am I not scanning the PTE table, which is
>> invariant to the page size?
> 
> I got the calculation wrong the first time and it's actually worst.
> Lets hope I got this right this time
> on ARM64 64k kernel:
> PMD size = 512M
> PTE= 64k
> PTEs per PMD = 8192

*facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't 
depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) = 
PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1 
<< 13.

> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
> first and second order are skipped)
> 
> Assuming I understand your algorithm correctly, in the worst case you
> are scanning the whole PMD for each order.
> 
> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.

Yup. Now it seems that the bitmap overhead may just be worth it; for the 
worst case the bitmap will give us an 11x saving...for the average case, 
it will give us 2x, but still, 8192 is a large number. I'll think of 
ways to test this out.

Btw, I was made aware that an LWN article just got posted on our work!
https://lwn.net/Articles/1009039/

> 
> Please let me know if I'm missing something here.
>>
>>>>
>>>> [1]
>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>>>
>>>>>
>>>>> So in my eyes, this is not a "problem"
>>>>
>>>> Looks like the kernel scheduled us for a high-priority debate, I hope
>>>> there's no deadlock :)
>>>>
>>>>>
>>>>> Cheers,
>>>>> -- Nico
>>>>>
>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (9 preceding siblings ...)
  2025-02-11 12:49 ` [RFC v2 0/9] khugepaged: mTHP support Dev Jain
@ 2025-02-17  6:39 ` Dev Jain
  2025-02-17 19:15   ` Nico Pache
  2025-02-18 16:07 ` Ryan Roberts
  2025-02-19 17:00 ` Ryan Roberts
  12 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-02-17  6:39 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/25 6:00 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> constantly promote mTHPs to the next available size.

How does creep stop when max_ptes_none <= 255?

> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>   Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>     monitoring a number of stats and tracepoints. The code is available
>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>     The summary from my testings was that there was no significant regression
>     noticed through this test. In some cases my changes had better collapse
>     latencies, and was able to scan more pages in the same amount of time/work,
>     but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>    (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>     the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>     some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
>      madvise_collapse
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: improve tracepoints for mTHP orders
>    khugepaged: skip collapsing mTHP to smaller orders
> 
>   include/linux/khugepaged.h         |   4 +
>   include/trace/events/huge_memory.h |  34 ++-
>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>   3 files changed, 306 insertions(+), 154 deletions(-)
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-02-11  0:30 ` [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-02-17  7:27   ` Dev Jain
  2025-02-17 19:12   ` Usama Arif
  2025-02-19 16:28   ` Ryan Roberts
  2 siblings, 0 replies; 55+ messages in thread
From: Dev Jain @ 2025-02-17  7:27 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/25 6:00 am, Nico Pache wrote:
> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> mTHP support we use this scan to instead record chunks of fully utilized
> sections of the PMD.
> 
> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> by default we will set this to order 3. The reasoning is that for 4K 512
> PMD size this results in a 64 bit bitmap which has some optimizations.
> For other arches like ARM64 64K, we can set a larger order if needed.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of utilized regions. We can then determine what
> mTHP size fits best and in the following patch, we set this bitmap while
> scanning the PMD.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> If a order is set to "always" lets always collapse to that order in a
> greedy manner.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   include/linux/khugepaged.h |  4 ++
>   mm/khugepaged.c            | 89 +++++++++++++++++++++++++++++++++++---
>   2 files changed, 86 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 1f46046080f5..1fe0c4fc9d37 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,6 +1,10 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
>   #ifndef _LINUX_KHUGEPAGED_H
>   #define _LINUX_KHUGEPAGED_H
> +#define MIN_MTHP_ORDER	3
> +#define MIN_MTHP_NR	(1<<MIN_MTHP_ORDER)
> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE) - MIN_MTHP_ORDER))
> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
>   
>   extern unsigned int khugepaged_max_ptes_none __read_mostly;
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3776055bd477..c8048d9ec7fb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>   
>   static struct kmem_cache *mm_slot_cache __ro_after_init;
>   
> +struct scan_bit_state {
> +	u8 order;
> +	u16 offset;
> +};
> +
>   struct collapse_control {
>   	bool is_khugepaged;
>   
> @@ -102,6 +107,15 @@ struct collapse_control {
>   
>   	/* nodemask for allocation fallback */
>   	nodemask_t alloc_nmask;
> +
> +	/* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> +	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> +	.is_khugepaged = true,
>   };
>   
>   /**
> @@ -851,10 +865,6 @@ static void khugepaged_alloc_sleep(void)
>   	remove_wait_queue(&khugepaged_wait, &wait);
>   }
>   
> -struct collapse_control khugepaged_collapse_control = {
> -	.is_khugepaged = true,
> -};
> -
>   static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>   {
>   	int i;
> @@ -1112,7 +1122,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>   
>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, bool *mmap_locked,
> +				  u8 order, u16 offset)
>   {
>   	LIST_HEAD(compound_pagelist);
>   	pmd_t *pmd, _pmd;
> @@ -1130,8 +1141,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * The allocation can take potentially a long time if it involves
>   	 * sync compaction, and we do not need to hold the mmap_lock during
>   	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>   	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}
>   
>   	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)
> @@ -1266,12 +1281,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   out_up_write:
>   	mmap_write_unlock(mm);
>   out_nolock:
> +	*mmap_locked = false;
>   	if (folio)
>   		folio_put(folio);
>   	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>   	return result;
>   }
>   
> +// Recursive function to consume the bitmap
> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	u8 order, next_order;
> +	u16 offset, mid_offset;
> +	int num_chunks;
> +	int bits_set, threshold_bits;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order + MIN_MTHP_ORDER;
> +		offset = state.offset;
> +		num_chunks = 1 << (state.order);
> +		// Skip mTHP orders that are not enabled
> +		if (!test_bit(order, &enabled_orders))
> +			goto next;
> +
> +		// copy the relavant section to a new bitmap
> +		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> +				  MTHP_BITMAP_SIZE);
> +
> +		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> +		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> +				>> (HPAGE_PMD_ORDER - state.order);
> +
> +		//Check if the region is "almost full" based on the threshold
> +		if (bits_set > threshold_bits
> +			|| test_bit(order, &huge_anon_orders_always)) {
> +			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> +					mmap_locked, order, offset * MIN_MTHP_NR);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += (1 << order);
> +				continue;
> +			}

If collapse_huge_page() fails due to hugepage_vma_revalidate() or 
find_pmd_or_thp_or_none(), you should exit.


> +		}
> +
> +next:
> +		if (state.order > 0) {
> +			next_order = state.order - 1;
> +			mid_offset = offset + (num_chunks / 2);
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, mid_offset };
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, offset };
> +			}
> +	}
> +	return collapsed;
> +}
> +
>   static int khugepaged_scan_pmd(struct mm_struct *mm,
>   				   struct vm_area_struct *vma,
>   				   unsigned long address, bool *mmap_locked,
> @@ -1440,7 +1514,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
>   		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>   		/* collapse_huge_page will return with the mmap_lock released */
>   		*mmap_locked = false;
>   	}
> @@ -2856,6 +2930,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   	mmdrop(mm);
>   	kfree(cc);
>   
> +
>   	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>   			: madvise_collapse_errno(last_fail);
>   }



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-15  6:38             ` Dev Jain
@ 2025-02-17  8:05               ` Dev Jain
  2025-02-17 19:19                 ` Nico Pache
  0 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-02-17  8:05 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 15/02/25 12:08 pm, Dev Jain wrote:
> 
> 
> On 15/02/25 6:22 am, Nico Pache wrote:
>> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> On 14/02/25 1:09 am, Nico Pache wrote:
>>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
>>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>>>>>> The following series provides khugepaged and madvise collapse 
>>>>>>>> with the
>>>>>>>> capability to collapse regions to mTHPs.
>>>>>>>>
>>>>>>>> To achieve this we generalize the khugepaged functions to no 
>>>>>>>> longer depend
>>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks 
>>>>>>>> of pages
>>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>>>>> using a bitmap. After the PMD scan is done, we do binary 
>>>>>>>> recursion on the
>>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The 
>>>>>>>> restriction
>>>>>>>> on max_ptes_none is removed during the scan, to make sure we 
>>>>>>>> account for
>>>>>>>> the whole PMD range. max_ptes_none will be scaled by the 
>>>>>>>> attempted collapse
>>>>>>>> order to determine how full a THP must be to be eligible. If a 
>>>>>>>> mTHP collapse
>>>>>>>> is attempted, but contains swapped out, or shared pages, we dont 
>>>>>>>> perform the
>>>>>>>> collapse.
>>>>>>>>
>>>>>>>> With the default max_ptes_none=511, the code should keep its 
>>>>>>>> most of its
>>>>>>>> original behavior. To exercise mTHP collapse we need to set 
>>>>>>>> max_ptes_none<=255.
>>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse 
>>>>>>>> "creep" and
>>>>>>>> constantly promote mTHPs to the next available size.
>>>>>>>>
>>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and 
>>>>>>>> khugepaged
>>>>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>>>>> Patch 6-9:   The mTHP patches
>>>>>>>>
>>>>>>>> ---------
>>>>>>>>      Testing
>>>>>>>> ---------
>>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>>>>> - selftests mm
>>>>>>>> - I created a test script that I used to push khugepaged to its 
>>>>>>>> limits while
>>>>>>>>        monitoring a number of stats and tracepoints. The code is 
>>>>>>>> available
>>>>>>>>        here[1] (Run in legacy mode for these changes and set 
>>>>>>>> mthp sizes to inherit)
>>>>>>>>        The summary from my testings was that there was no 
>>>>>>>> significant regression
>>>>>>>>        noticed through this test. In some cases my changes had 
>>>>>>>> better collapse
>>>>>>>>        latencies, and was able to scan more pages in the same 
>>>>>>>> amount of time/work,
>>>>>>>>        but for the most part the results were consistant.
>>>>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>>>>       (see followup post for more details).
>>>>>>>> - some basic testing on 64k page size.
>>>>>>>> - lots of general use. These changes have been running in my VM 
>>>>>>>> for some time.
>>>>>>>>
>>>>>>>> Changes since V1 [2]:
>>>>>>>> - Minor bug fixes discovered during review and testing
>>>>>>>> - removed dynamic allocations for bitmaps, and made them stack 
>>>>>>>> based
>>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>>>>> - Updated trace events to include collapsing order info.
>>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 
>>>>>>>> scale.
>>>>>>>> - No longer require a chunk to be fully utilized before setting 
>>>>>>>> the bit. Use
>>>>>>>>        the same max_ptes_none scaling principle to achieve this.
>>>>>>>> - Skip mTHP collapse that requires swapin or shared handling. 
>>>>>>>> This helps prevent
>>>>>>>>        some of the "creep" that was discovered in v1.
>>>>>>>>
>>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1- 
>>>>>>>> npache@redhat.com/
>>>>>>>>
>>>>>>>> Nico Pache (9):
>>>>>>>>       introduce khugepaged_collapse_single_pmd to unify 
>>>>>>>> khugepaged and
>>>>>>>>         madvise_collapse
>>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP 
>>>>>>>> support
>>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP 
>>>>>>>> support
>>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>>>>       khugepaged: add mTHP support
>>>>>>>>       khugepaged: improve tracepoints for mTHP orders
>>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
>>>>>>>>
>>>>>>>>      include/linux/khugepaged.h         |   4 +
>>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
>>>>>>>>      mm/khugepaged.c                    | 422 ++++++++++++++++++ 
>>>>>>>> +----------
>>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> Does this patchset suffer from the problem described here:
>>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680- 
>>>>>>> c2d48d4963b6@arm.com/
>>>>>> Hi Dev,
>>>>>>
>>>>>> Sorry I meant to get back to you about that.
>>>>>>
>>>>>> I understand your concern, but like I've mentioned before, the scan
>>>>>> with the read lock was done so we dont have to do the more expensive
>>>>>> locking, and could still gain insight into the state. You are right
>>>>>> that this info could become stale if the state changes dramatically,
>>>>>> but the collapse_isolate function will verify it and not collapse.
>>>>>
>>>>> If the state changes dramatically, the _isolate function will 
>>>>> verify it,
>>>>> and fallback. And this fallback happens after following this costly
>>>>> path: retrieve a large folio from the buddy allocator -> swapin pages
>>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB 
>>>>> flush
>>>>> on all CPUs -> fallback in _isolate().
>>>>> If you do fail in _isolate(), doesn't it make sense to get the updated
>>>>> state for the next fallback order immediately, because we have prior
>>>>> information that we failed because of PTE state? What your algorithm
>>>>> will do is *still* follow the costly path described above, and again
>>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() 
>>>>> like
>>>>> mine would.
>>>>
>>>> You do raise a valid point here, I can optimize my solution by
>>>> detecting certain collapse failure types and jump to the next scan.
>>>> I'll add that to my solution, thanks!
>>>>
>>>> As for the disagreement around the bitmap, we'll leave that up to the
>>>> community to decide since we have differing opinions/solutions.
>>>>
>>>>>
>>>>> The verification of the PTE state by the _isolate() function is the 
>>>>> "no
>>>>> turning back" point of the algorithm. The verification by
>>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even 
>>>>> worth
>>>>> it, before we do costly operations" point of the algorithm.
>>>>>
>>>>>>    From my testing I found this to rarely happen.
>>>>>
>>>>> Unfortunately, I am not very familiar with performance testing/load
>>>>> testing, I am fairly new to kernel programming, so I am getting there.
>>>>> But it really depends on the type of test you are running, what 
>>>>> actually
>>>>> runs on memory-intensive systems, etc etc. In fact, on loaded 
>>>>> systems I
>>>>> would expect the PTE state to dramatically change. But still, no 
>>>>> opinion
>>>>> here.
>>>>
>>>> Yeah there are probably some cases where it happens more often.
>>>> Probably in cases of short lived allocations, but khugepaged doesn't
>>>> run that frequently so those won't be that big of an issue.
>>>>
>>>> Our performance team is currently testing my implementation so I
>>>> should have more real workload test results soon. The redis testing
>>>> had some gains and didn't show any signs of obvious regressions.
>>>>
>>>> As for the testing, check out
>>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record- 
>>>> khuge-performance.sh?ref_type=heads
>>>> this does the tracing for my testing script. It can help you get
>>>> started. There are 3 different traces being applied there: the
>>>> bpftrace for collapse latencies, the perf record for the flamegraph
>>>> (not actually that useful, but may be useful to visualize any
>>>> weird/long paths that you may not have noticed), and the trace-cmd
>>>> which records the tracepoint of the scan and the collapse functions
>>>> then processes the data using the awk script-- the output being the
>>>> scan rate, the pages collapsed, and their result status (grouped by
>>>> order).
>>>>
>>>> You can also look into https://github.com/gormanm/mmtests for
>>>> testing/comparing kernels. I was running the
>>>> config-memdb-redis-benchmark-medium workload.
>>>
>>> Thanks. I'll take a look.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Also, khugepaged, my changes, and your changes are all a victim of
>>>>>> this. Once we drop the read lock (to either allocate the folio, or
>>>>>> right before acquiring the write_lock), the state can change. In your
>>>>>> case, yes, you are gathering more up to date information, but is it
>>>>>> really that important/worth it to retake locks and rescan for each
>>>>>> instance if we are about to reverify with the write lock taken?
>>>>>
>>>>> You said "reverify": You are removing the verification, so this step
>>>>> won't be reverification, it will be verification. We do not want to
>>>>> verify *after* we have already done 95% of latency-heavy stuff, 
>>>>> only to
>>>>> know that we are going to fail.
>>>>>
>>>>> Algorithms in the kernel, in general, are of the following form: 1)
>>>>> Verify if a condition is true, resulting in taking a control path - 
>>>>> > 2)
>>>>> do a lot of stuff -> "no turning back" step, wherein before committing
>>>>> (by taking locks, say), reverify if this is the control path we should
>>>>> be in. You are eliminating step 1).
>>>>>
>>>>> Therefore, I will have to say that I disagree with your approach.
>>>>>
>>>>> On top of this, in the subjective analysis in [1], point number 7 
>>>>> (along
>>>>> with point number 1) remains. And, point number 4 remains.
>>>>
>>>> for 1) your worst case of 1024 is not the worst case. There are 8
>>>> possible orders in your implementation, if all are enabled, that is
>>>> 4096 iterations in the worst case.
>>>
>>> Yes, that is exactly what I wrote in 1). I am still not convinced that
>>> the overhead you produce + 512 iterations is going to beat 4096
>>> iterations. Anyways, that is hand-waving and we should test this.
>>>
>>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 
>>>> in my case.
>>>
>>> Sorry, I am missing something here; how does the number of iterations
>>> change with page size? Am I not scanning the PTE table, which is
>>> invariant to the page size?
>>
>> I got the calculation wrong the first time and it's actually worst.
>> Lets hope I got this right this time
>> on ARM64 64k kernel:
>> PMD size = 512M
>> PTE= 64k
>> PTEs per PMD = 8192
> 
> *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't 
> depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) = 
> PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1 
> << 13.
> 
>> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
>> first and second order are skipped)
>>
>> Assuming I understand your algorithm correctly, in the worst case you
>> are scanning the whole PMD for each order.
>>
>> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.
> 
> Yup. Now it seems that the bitmap overhead may just be worth it; for the 
> worst case the bitmap will give us an 11x saving...for the average case, 
> it will give us 2x, but still, 8192 is a large number. I'll think of 

Clearing on this: the saving is w.r.t the initial scan. That is, if time 
taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and 
y is the bitmap overhead, then time taken by DJ is 2x + 
collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in 
_isolate(). Anyhow, we differ in opinion as to where the max_ptes_* 
check should be placed; I recalled the following:

https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@arm.com/
https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@redhat.com/

One thing you can do to relieve one of my criticisms (not completely) is 
apply the following patch (this can be done in both methods):

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b589f889bb5a..dc5cb602eaad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct 
mm_struct *mm,
  		}

  		vmf.orig_pte = ptep_get_lockless(pte);
-		if (!is_swap_pte(vmf.orig_pte))
+		if (!is_swap_pte(vmf.orig_pte)) {
  			continue;
+		} else {
+			if (order != HPAGE_PMD_ORDER) {
+				result = SCAN_EXCEED_SWAP_PTE;
+				goto out;
+			}
+		}

  		vmf.pte = pte;
  		vmf.ptl = ptl;
-- 

But this really is the same thing being done in the links above :)


> ways to test this out.
> 
> Btw, I was made aware that an LWN article just got posted on our work!
> https://lwn.net/Articles/1009039/
> 
>>
>> Please let me know if I'm missing something here.
>>>
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b- 
>>>>> aba4b1a441b4@arm.com/
>>>>>
>>>>>>
>>>>>> So in my eyes, this is not a "problem"
>>>>>
>>>>> Looks like the kernel scheduled us for a high-priority debate, I hope
>>>>> there's no deadlock :)
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> -- Nico
>>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 
> 



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-02-11  0:30 ` [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-02-17 17:11   ` Usama Arif
  2025-02-17 19:56     ` Nico Pache
  2025-02-18 16:26   ` Ryan Roberts
  1 sibling, 1 reply; 55+ messages in thread
From: Usama Arif @ 2025-02-17 17:11 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/2025 00:30, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
> 
> Create khugepaged_collapse_single_pmd to increase code
> reuse and create an entry point for future khugepaged changes.
> 
> Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> the new khugepaged_collapse_single_pmd function.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
>  1 file changed, 50 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5f0be134141e..46faee67378b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2365,6 +2365,52 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *mm,
> +				   struct vm_area_struct *vma, bool *mmap_locked,
> +				   struct collapse_control *cc)
> +{
> +	int result = SCAN_FAIL;
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> +
> +	if (!*mmap_locked) {
> +		mmap_read_lock(mm);
> +		*mmap_locked = true;
> +	}
> +
> +	if (thp_vma_allowable_order(vma, vma->vm_flags,
> +					tva_flags, PMD_ORDER)) {
> +		if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> +			struct file *file = get_file(vma->vm_file);
> +			pgoff_t pgoff = linear_page_index(vma, addr);
> +
> +			mmap_read_unlock(mm);
> +			*mmap_locked = false;
> +			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> +							  cc);
> +			fput(file);
> +			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +				mmap_read_lock(mm);
> +				if (hpage_collapse_test_exit_or_disable(mm))
> +					goto end;
> +				result = collapse_pte_mapped_thp(mm, addr,
> +								 !cc->is_khugepaged);
> +				mmap_read_unlock(mm);
> +			}
> +		} else {
> +			result = hpage_collapse_scan_pmd(mm, vma, addr,
> +							 mmap_locked, cc);
> +		}
> +		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> +			++khugepaged_pages_collapsed;
> +	}
> +end:
> +	return result;
> +}
> +
>  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  					    struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
> @@ -2439,33 +2485,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>  				  hend);
> -			if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> -				struct file *file = get_file(vma->vm_file);
> -				pgoff_t pgoff = linear_page_index(vma,
> -						khugepaged_scan.address);
>  
> -				mmap_read_unlock(mm);
> -				mmap_locked = false;
> -				*result = hpage_collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> -				fput(file);
> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -					mmap_read_lock(mm);
> -					if (hpage_collapse_test_exit_or_disable(mm))
> -						goto breakouterloop;
> -					*result = collapse_pte_mapped_thp(mm,
> -						khugepaged_scan.address, false);
> -					if (*result == SCAN_PMD_MAPPED)
> -						*result = SCAN_SUCCEED;
> -					mmap_read_unlock(mm);
> -				}
> -			} else {
> -				*result = hpage_collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> -			}
> -
> -			if (*result == SCAN_SUCCEED)
> -				++khugepaged_pages_collapsed;
> +			*result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
> +						mm, vma, &mmap_locked, cc);
>  
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> @@ -2785,36 +2807,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		mmap_assert_locked(mm);
>  		memset(cc->node_load, 0, sizeof(cc->node_load));
>  		nodes_clear(cc->alloc_nmask);
> -		if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> -			struct file *file = get_file(vma->vm_file);
> -			pgoff_t pgoff = linear_page_index(vma, addr);
>  
> -			mmap_read_unlock(mm);
> -			mmap_locked = false;
> -			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> -							  cc);
> -			fput(file);
> -		} else {
> -			result = hpage_collapse_scan_pmd(mm, vma, addr,
> -							 &mmap_locked, cc);
> -		}
> +		result = khugepaged_collapse_single_pmd(addr, mm, vma, &mmap_locked, cc);
> +

you will be incrementing khugepaged_pages_collapsed at madvise_collapse by calling
khugepaged_collapse_single_pmd which is not correct.

>  		if (!mmap_locked)
>  			*prev = NULL;  /* Tell caller we dropped mmap_lock */
>  
> -handle_result:
>  		switch (result) {
>  		case SCAN_SUCCEED:
>  		case SCAN_PMD_MAPPED:
>  			++thps;
>  			break;
>  		case SCAN_PTE_MAPPED_HUGEPAGE:
> -			BUG_ON(mmap_locked);
> -			BUG_ON(*prev);
> -			mmap_read_lock(mm);
> -			result = collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_read_unlock(mm);
> -			goto handle_result;
> -		/* Whitelisted set of results where continuing OK */
>  		case SCAN_PMD_NULL:
>  		case SCAN_PTE_NON_PRESENT:
>  		case SCAN_PTE_UFFD_WP:



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-02-11  0:30 ` [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
  2025-02-17  7:27   ` Dev Jain
@ 2025-02-17 19:12   ` Usama Arif
  2025-02-19 16:28   ` Ryan Roberts
  2 siblings, 0 replies; 55+ messages in thread
From: Usama Arif @ 2025-02-17 19:12 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/2025 00:30, Nico Pache wrote:
> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> mTHP support we use this scan to instead record chunks of fully utilized
> sections of the PMD.
> 
> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.

nit:

s/MTHP_MIN_ORDER/MIN_MTHP_ORDER/


> by default we will set this to order 3. The reasoning is that for 4K 512
> PMD size this results in a 64 bit bitmap which has some optimizations.
> For other arches like ARM64 64K, we can set a larger order if needed.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of utilized regions. We can then determine what
> mTHP size fits best and in the following patch, we set this bitmap while
> scanning the PMD.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> If a order is set to "always" lets always collapse to that order in a
> greedy manner.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/linux/khugepaged.h |  4 ++
>  mm/khugepaged.c            | 89 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 86 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 1f46046080f5..1fe0c4fc9d37 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,6 +1,10 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _LINUX_KHUGEPAGED_H
>  #define _LINUX_KHUGEPAGED_H
> +#define MIN_MTHP_ORDER	3
> +#define MIN_MTHP_NR	(1<<MIN_MTHP_ORDER)
> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE) - MIN_MTHP_ORDER))
> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
>  
>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3776055bd477..c8048d9ec7fb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>  
> +struct scan_bit_state {
> +	u8 order;
> +	u16 offset;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>  
> @@ -102,6 +107,15 @@ struct collapse_control {
>  
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> +	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> +	.is_khugepaged = true,
>  };
>  
>  /**
> @@ -851,10 +865,6 @@ static void khugepaged_alloc_sleep(void)
>  	remove_wait_queue(&khugepaged_wait, &wait);
>  }
>  
> -struct collapse_control khugepaged_collapse_control = {
> -	.is_khugepaged = true,
> -};
> -
>  static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>  {
>  	int i;
> @@ -1112,7 +1122,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  
>  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, bool *mmap_locked,
> +				  u8 order, u16 offset)
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> @@ -1130,8 +1141,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * The allocation can take potentially a long time if it involves
>  	 * sync compaction, and we do not need to hold the mmap_lock during
>  	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>  	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}
>  
>  	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
> @@ -1266,12 +1281,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  out_up_write:
>  	mmap_write_unlock(mm);
>  out_nolock:
> +	*mmap_locked = false;
>  	if (folio)
>  		folio_put(folio);
>  	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>  	return result;
>  }
>  
> +// Recursive function to consume the bitmap
> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{

Introducing a function and not using it probably might make the kernel test bot
and compiler complain at this commit, you might want to merge this with the next
commit where you actually use it.

> +	u8 order, next_order;
> +	u16 offset, mid_offset;
> +	int num_chunks;
> +	int bits_set, threshold_bits;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order + MIN_MTHP_ORDER;
> +		offset = state.offset;
> +		num_chunks = 1 << (state.order);
> +		// Skip mTHP orders that are not enabled
> +		if (!test_bit(order, &enabled_orders))
> +			goto next;
> +
> +		// copy the relavant section to a new bitmap
> +		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> +				  MTHP_BITMAP_SIZE);
> +
> +		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> +		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> +				>> (HPAGE_PMD_ORDER - state.order);
> +
> +		//Check if the region is "almost full" based on the threshold
> +		if (bits_set > threshold_bits
> +			|| test_bit(order, &huge_anon_orders_always)) {
> +			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> +					mmap_locked, order, offset * MIN_MTHP_NR);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += (1 << order);
> +				continue;
> +			}
> +		}
> +
> +next:
> +		if (state.order > 0) {
> +			next_order = state.order - 1;
> +			mid_offset = offset + (num_chunks / 2);
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, mid_offset };
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, offset };
> +			}
> +	}
> +	return collapsed;
> +}
> +
>  static int khugepaged_scan_pmd(struct mm_struct *mm,
>  				   struct vm_area_struct *vma,
>  				   unsigned long address, bool *mmap_locked,
> @@ -1440,7 +1514,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*mmap_locked = false;
>  	}
> @@ -2856,6 +2930,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	mmdrop(mm);
>  	kfree(cc);
>  
> +
>  	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>  			: madvise_collapse_errno(last_fail);
>  }



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-17  6:39 ` Dev Jain
@ 2025-02-17 19:15   ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-17 19:15 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Sun, Feb 16, 2025 at 11:39 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/02/25 6:00 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> > order to determine how full a THP must be to be eligible. If a mTHP collapse
> > is attempted, but contains swapped out, or shared pages, we dont perform the
> > collapse.
> >
> > With the default max_ptes_none=511, the code should keep its most of its
> > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> > constantly promote mTHPs to the next available size.
>
> How does creep stop when max_ptes_none <= 255?

think of a 512kB mTHP region that is half utilized (but not
collapsed), and max_ptes_none = 255.

threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
                               >> (HPAGE_PMD_ORDER - state.order);

threshold_bits = 512 - 1 - 255 >> 9 - 4 = 256 >> 5 = 8

So >8 bits must be set, where the bitmap length for this section is a
max of 16 bits.
This means more than half of the mTHP has to be utilized before
collapse. If we collapse and introduce half of an empty mTHP then the
mTHP will be available next round for another promotion.
When max_ptes_none is 256, the threshold_bits = 7 and collapsing will
make enough non-none pages to collapse again.

>
> >
> > Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> > Patch 2:     Refactor/rename hpage_collapse
> > Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> > Patch 6-9:   The mTHP patches
> >
> > ---------
> >   Testing
> > ---------
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - I created a test script that I used to push khugepaged to its limits while
> >     monitoring a number of stats and tracepoints. The code is available
> >     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >     The summary from my testings was that there was no significant regression
> >     noticed through this test. In some cases my changes had better collapse
> >     latencies, and was able to scan more pages in the same amount of time/work,
> >     but for the most part the results were consistant.
> > - redis testing. I tested these changes along with my defer changes
> >    (see followup post for more details).
> > - some basic testing on 64k page size.
> > - lots of general use. These changes have been running in my VM for some time.
> >
> > Changes since V1 [2]:
> > - Minor bug fixes discovered during review and testing
> > - removed dynamic allocations for bitmaps, and made them stack based
> > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> > - Updated trace events to include collapsing order info.
> > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> > - No longer require a chunk to be fully utilized before setting the bit. Use
> >     the same max_ptes_none scaling principle to achieve this.
> > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >     some of the "creep" that was discovered in v1.
> >
> > [1] - https://gitlab.com/npache/khugepaged_mthp_test
> > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >
> > Nico Pache (9):
> >    introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >      madvise_collapse
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: improve tracepoints for mTHP orders
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h         |   4 +
> >   include/trace/events/huge_memory.h |  34 ++-
> >   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >   3 files changed, 306 insertions(+), 154 deletions(-)
> >
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-17  8:05               ` Dev Jain
@ 2025-02-17 19:19                 ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-17 19:19 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Mon, Feb 17, 2025 at 1:06 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 15/02/25 12:08 pm, Dev Jain wrote:
> >
> >
> > On 15/02/25 6:22 am, Nico Pache wrote:
> >> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
> >>>
> >>>
> >>>
> >>> On 14/02/25 1:09 am, Nico Pache wrote:
> >>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
> >>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
> >>>>>>>> The following series provides khugepaged and madvise collapse
> >>>>>>>> with the
> >>>>>>>> capability to collapse regions to mTHPs.
> >>>>>>>>
> >>>>>>>> To achieve this we generalize the khugepaged functions to no
> >>>>>>>> longer depend
> >>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks
> >>>>>>>> of pages
> >>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>>>>>>> using a bitmap. After the PMD scan is done, we do binary
> >>>>>>>> recursion on the
> >>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The
> >>>>>>>> restriction
> >>>>>>>> on max_ptes_none is removed during the scan, to make sure we
> >>>>>>>> account for
> >>>>>>>> the whole PMD range. max_ptes_none will be scaled by the
> >>>>>>>> attempted collapse
> >>>>>>>> order to determine how full a THP must be to be eligible. If a
> >>>>>>>> mTHP collapse
> >>>>>>>> is attempted, but contains swapped out, or shared pages, we dont
> >>>>>>>> perform the
> >>>>>>>> collapse.
> >>>>>>>>
> >>>>>>>> With the default max_ptes_none=511, the code should keep its
> >>>>>>>> most of its
> >>>>>>>> original behavior. To exercise mTHP collapse we need to set
> >>>>>>>> max_ptes_none<=255.
> >>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse
> >>>>>>>> "creep" and
> >>>>>>>> constantly promote mTHPs to the next available size.
> >>>>>>>>
> >>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and
> >>>>>>>> khugepaged
> >>>>>>>> Patch 2:     Refactor/rename hpage_collapse
> >>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>>>>>>> Patch 6-9:   The mTHP patches
> >>>>>>>>
> >>>>>>>> ---------
> >>>>>>>>      Testing
> >>>>>>>> ---------
> >>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>>>>>>> - selftests mm
> >>>>>>>> - I created a test script that I used to push khugepaged to its
> >>>>>>>> limits while
> >>>>>>>>        monitoring a number of stats and tracepoints. The code is
> >>>>>>>> available
> >>>>>>>>        here[1] (Run in legacy mode for these changes and set
> >>>>>>>> mthp sizes to inherit)
> >>>>>>>>        The summary from my testings was that there was no
> >>>>>>>> significant regression
> >>>>>>>>        noticed through this test. In some cases my changes had
> >>>>>>>> better collapse
> >>>>>>>>        latencies, and was able to scan more pages in the same
> >>>>>>>> amount of time/work,
> >>>>>>>>        but for the most part the results were consistant.
> >>>>>>>> - redis testing. I tested these changes along with my defer changes
> >>>>>>>>       (see followup post for more details).
> >>>>>>>> - some basic testing on 64k page size.
> >>>>>>>> - lots of general use. These changes have been running in my VM
> >>>>>>>> for some time.
> >>>>>>>>
> >>>>>>>> Changes since V1 [2]:
> >>>>>>>> - Minor bug fixes discovered during review and testing
> >>>>>>>> - removed dynamic allocations for bitmaps, and made them stack
> >>>>>>>> based
> >>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>>>>>>> - Updated trace events to include collapsing order info.
> >>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100
> >>>>>>>> scale.
> >>>>>>>> - No longer require a chunk to be fully utilized before setting
> >>>>>>>> the bit. Use
> >>>>>>>>        the same max_ptes_none scaling principle to achieve this.
> >>>>>>>> - Skip mTHP collapse that requires swapin or shared handling.
> >>>>>>>> This helps prevent
> >>>>>>>>        some of the "creep" that was discovered in v1.
> >>>>>>>>
> >>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-
> >>>>>>>> npache@redhat.com/
> >>>>>>>>
> >>>>>>>> Nico Pache (9):
> >>>>>>>>       introduce khugepaged_collapse_single_pmd to unify
> >>>>>>>> khugepaged and
> >>>>>>>>         madvise_collapse
> >>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>>>>>>       khugepaged: add mTHP support
> >>>>>>>>       khugepaged: improve tracepoints for mTHP orders
> >>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
> >>>>>>>>
> >>>>>>>>      include/linux/khugepaged.h         |   4 +
> >>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
> >>>>>>>>      mm/khugepaged.c                    | 422 ++++++++++++++++++
> >>>>>>>> +----------
> >>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Does this patchset suffer from the problem described here:
> >>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-
> >>>>>>> c2d48d4963b6@arm.com/
> >>>>>> Hi Dev,
> >>>>>>
> >>>>>> Sorry I meant to get back to you about that.
> >>>>>>
> >>>>>> I understand your concern, but like I've mentioned before, the scan
> >>>>>> with the read lock was done so we dont have to do the more expensive
> >>>>>> locking, and could still gain insight into the state. You are right
> >>>>>> that this info could become stale if the state changes dramatically,
> >>>>>> but the collapse_isolate function will verify it and not collapse.
> >>>>>
> >>>>> If the state changes dramatically, the _isolate function will
> >>>>> verify it,
> >>>>> and fallback. And this fallback happens after following this costly
> >>>>> path: retrieve a large folio from the buddy allocator -> swapin pages
> >>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB
> >>>>> flush
> >>>>> on all CPUs -> fallback in _isolate().
> >>>>> If you do fail in _isolate(), doesn't it make sense to get the updated
> >>>>> state for the next fallback order immediately, because we have prior
> >>>>> information that we failed because of PTE state? What your algorithm
> >>>>> will do is *still* follow the costly path described above, and again
> >>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd()
> >>>>> like
> >>>>> mine would.
> >>>>
> >>>> You do raise a valid point here, I can optimize my solution by
> >>>> detecting certain collapse failure types and jump to the next scan.
> >>>> I'll add that to my solution, thanks!
> >>>>
> >>>> As for the disagreement around the bitmap, we'll leave that up to the
> >>>> community to decide since we have differing opinions/solutions.
> >>>>
> >>>>>
> >>>>> The verification of the PTE state by the _isolate() function is the
> >>>>> "no
> >>>>> turning back" point of the algorithm. The verification by
> >>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even
> >>>>> worth
> >>>>> it, before we do costly operations" point of the algorithm.
> >>>>>
> >>>>>>    From my testing I found this to rarely happen.
> >>>>>
> >>>>> Unfortunately, I am not very familiar with performance testing/load
> >>>>> testing, I am fairly new to kernel programming, so I am getting there.
> >>>>> But it really depends on the type of test you are running, what
> >>>>> actually
> >>>>> runs on memory-intensive systems, etc etc. In fact, on loaded
> >>>>> systems I
> >>>>> would expect the PTE state to dramatically change. But still, no
> >>>>> opinion
> >>>>> here.
> >>>>
> >>>> Yeah there are probably some cases where it happens more often.
> >>>> Probably in cases of short lived allocations, but khugepaged doesn't
> >>>> run that frequently so those won't be that big of an issue.
> >>>>
> >>>> Our performance team is currently testing my implementation so I
> >>>> should have more real workload test results soon. The redis testing
> >>>> had some gains and didn't show any signs of obvious regressions.
> >>>>
> >>>> As for the testing, check out
> >>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-
> >>>> khuge-performance.sh?ref_type=heads
> >>>> this does the tracing for my testing script. It can help you get
> >>>> started. There are 3 different traces being applied there: the
> >>>> bpftrace for collapse latencies, the perf record for the flamegraph
> >>>> (not actually that useful, but may be useful to visualize any
> >>>> weird/long paths that you may not have noticed), and the trace-cmd
> >>>> which records the tracepoint of the scan and the collapse functions
> >>>> then processes the data using the awk script-- the output being the
> >>>> scan rate, the pages collapsed, and their result status (grouped by
> >>>> order).
> >>>>
> >>>> You can also look into https://github.com/gormanm/mmtests for
> >>>> testing/comparing kernels. I was running the
> >>>> config-memdb-redis-benchmark-medium workload.
> >>>
> >>> Thanks. I'll take a look.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Also, khugepaged, my changes, and your changes are all a victim of
> >>>>>> this. Once we drop the read lock (to either allocate the folio, or
> >>>>>> right before acquiring the write_lock), the state can change. In your
> >>>>>> case, yes, you are gathering more up to date information, but is it
> >>>>>> really that important/worth it to retake locks and rescan for each
> >>>>>> instance if we are about to reverify with the write lock taken?
> >>>>>
> >>>>> You said "reverify": You are removing the verification, so this step
> >>>>> won't be reverification, it will be verification. We do not want to
> >>>>> verify *after* we have already done 95% of latency-heavy stuff,
> >>>>> only to
> >>>>> know that we are going to fail.
> >>>>>
> >>>>> Algorithms in the kernel, in general, are of the following form: 1)
> >>>>> Verify if a condition is true, resulting in taking a control path -
> >>>>> > 2)
> >>>>> do a lot of stuff -> "no turning back" step, wherein before committing
> >>>>> (by taking locks, say), reverify if this is the control path we should
> >>>>> be in. You are eliminating step 1).
> >>>>>
> >>>>> Therefore, I will have to say that I disagree with your approach.
> >>>>>
> >>>>> On top of this, in the subjective analysis in [1], point number 7
> >>>>> (along
> >>>>> with point number 1) remains. And, point number 4 remains.
> >>>>
> >>>> for 1) your worst case of 1024 is not the worst case. There are 8
> >>>> possible orders in your implementation, if all are enabled, that is
> >>>> 4096 iterations in the worst case.
> >>>
> >>> Yes, that is exactly what I wrote in 1). I am still not convinced that
> >>> the overhead you produce + 512 iterations is going to beat 4096
> >>> iterations. Anyways, that is hand-waving and we should test this.
> >>>
> >>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096
> >>>> in my case.
> >>>
> >>> Sorry, I am missing something here; how does the number of iterations
> >>> change with page size? Am I not scanning the PTE table, which is
> >>> invariant to the page size?
> >>
> >> I got the calculation wrong the first time and it's actually worst.
> >> Lets hope I got this right this time
> >> on ARM64 64k kernel:
> >> PMD size = 512M
> >> PTE= 64k
> >> PTEs per PMD = 8192
> >
> > *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't
> > depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) =
> > PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1
> > << 13.
> >
> >> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
> >> first and second order are skipped)
> >>
> >> Assuming I understand your algorithm correctly, in the worst case you
> >> are scanning the whole PMD for each order.
> >>
> >> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.
> >
> > Yup. Now it seems that the bitmap overhead may just be worth it; for the
> > worst case the bitmap will give us an 11x saving...for the average case,
> > it will give us 2x, but still, 8192 is a large number. I'll think of
>
> Clearing on this: the saving is w.r.t the initial scan. That is, if time
> taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and
> y is the bitmap overhead, then time taken by DJ is 2x +
> collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in
> _isolate(). Anyhow, we differ in opinion as to where the max_ptes_*
> check should be placed; I recalled the following:
>
> https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@arm.com/
> https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@redhat.com/
>
> One thing you can do to relieve one of my criticisms (not completely) is
> apply the following patch (this can be done in both methods):
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b589f889bb5a..dc5cb602eaad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct
> mm_struct *mm,
>                 }
>
>                 vmf.orig_pte = ptep_get_lockless(pte);
> -               if (!is_swap_pte(vmf.orig_pte))
> +               if (!is_swap_pte(vmf.orig_pte)) {
>                         continue;
> +               } else {
> +                       if (order != HPAGE_PMD_ORDER) {
> +                               result = SCAN_EXCEED_SWAP_PTE;
> +                               goto out;
> +                       }
> +               }
>
>                 vmf.pte = pte;
>                 vmf.ptl = ptl;
> --
>
> But this really is the same thing being done in the links above :)

Yes my code already does this,
@@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct
mm_struct *mm,
                if (!is_swap_pte(vmf.orig_pte))
                        continue;

+               if (order != HPAGE_PMD_ORDER) {
+                       result = SCAN_EXCEED_SWAP_PTE;
+                       goto out;
+               }
+
No need for the if/else nesting.
>
>
> > ways to test this out.
> >
> > Btw, I was made aware that an LWN article just got posted on our work!
> > https://lwn.net/Articles/1009039/
> >
> >>
> >> Please let me know if I'm missing something here.
> >>>
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-
> >>>>> aba4b1a441b4@arm.com/
> >>>>>
> >>>>>>
> >>>>>> So in my eyes, this is not a "problem"
> >>>>>
> >>>>> Looks like the kernel scheduled us for a high-priority debate, I hope
> >>>>> there's no deadlock :)
> >>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> -- Nico
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-02-17 17:11   ` Usama Arif
@ 2025-02-17 19:56     ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-17 19:56 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Mon, Feb 17, 2025 at 10:11 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create khugepaged_collapse_single_pmd to increase code
> > reuse and create an entry point for future khugepaged changes.
> >
> > Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> > the new khugepaged_collapse_single_pmd function.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
> >  1 file changed, 50 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5f0be134141e..46faee67378b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2365,6 +2365,52 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >  }
> >  #endif
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *mm,
> > +                                struct vm_area_struct *vma, bool *mmap_locked,
> > +                                struct collapse_control *cc)
> > +{
> > +     int result = SCAN_FAIL;
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> > +
> > +     if (!*mmap_locked) {
> > +             mmap_read_lock(mm);
> > +             *mmap_locked = true;
> > +     }
> > +
> > +     if (thp_vma_allowable_order(vma, vma->vm_flags,
> > +                                     tva_flags, PMD_ORDER)) {
> > +             if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> > +                     struct file *file = get_file(vma->vm_file);
> > +                     pgoff_t pgoff = linear_page_index(vma, addr);
> > +
> > +                     mmap_read_unlock(mm);
> > +                     *mmap_locked = false;
> > +                     result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > +                                                       cc);
> > +                     fput(file);
> > +                     if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > +                             mmap_read_lock(mm);
> > +                             if (hpage_collapse_test_exit_or_disable(mm))
> > +                                     goto end;
> > +                             result = collapse_pte_mapped_thp(mm, addr,
> > +                                                              !cc->is_khugepaged);
> > +                             mmap_read_unlock(mm);
> > +                     }
> > +             } else {
> > +                     result = hpage_collapse_scan_pmd(mm, vma, addr,
> > +                                                      mmap_locked, cc);
> > +             }
> > +             if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> > +                     ++khugepaged_pages_collapsed;
> > +     }
> > +end:
> > +     return result;
> > +}
> > +
> >  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                                           struct collapse_control *cc)
> >       __releases(&khugepaged_mm_lock)
> > @@ -2439,33 +2485,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                       VM_BUG_ON(khugepaged_scan.address < hstart ||
> >                                 khugepaged_scan.address + HPAGE_PMD_SIZE >
> >                                 hend);
> > -                     if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> > -                             struct file *file = get_file(vma->vm_file);
> > -                             pgoff_t pgoff = linear_page_index(vma,
> > -                                             khugepaged_scan.address);
> >
> > -                             mmap_read_unlock(mm);
> > -                             mmap_locked = false;
> > -                             *result = hpage_collapse_scan_file(mm,
> > -                                     khugepaged_scan.address, file, pgoff, cc);
> > -                             fput(file);
> > -                             if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > -                                     mmap_read_lock(mm);
> > -                                     if (hpage_collapse_test_exit_or_disable(mm))
> > -                                             goto breakouterloop;
> > -                                     *result = collapse_pte_mapped_thp(mm,
> > -                                             khugepaged_scan.address, false);
> > -                                     if (*result == SCAN_PMD_MAPPED)
> > -                                             *result = SCAN_SUCCEED;
> > -                                     mmap_read_unlock(mm);
> > -                             }
> > -                     } else {
> > -                             *result = hpage_collapse_scan_pmd(mm, vma,
> > -                                     khugepaged_scan.address, &mmap_locked, cc);
> > -                     }
> > -
> > -                     if (*result == SCAN_SUCCEED)
> > -                             ++khugepaged_pages_collapsed;
> > +                     *result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
> > +                                             mm, vma, &mmap_locked, cc);
> >
> >                       /* move to next address */
> >                       khugepaged_scan.address += HPAGE_PMD_SIZE;
> > @@ -2785,36 +2807,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >               mmap_assert_locked(mm);
> >               memset(cc->node_load, 0, sizeof(cc->node_load));
> >               nodes_clear(cc->alloc_nmask);
> > -             if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> > -                     struct file *file = get_file(vma->vm_file);
> > -                     pgoff_t pgoff = linear_page_index(vma, addr);
> >
> > -                     mmap_read_unlock(mm);
> > -                     mmap_locked = false;
> > -                     result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > -                                                       cc);
> > -                     fput(file);
> > -             } else {
> > -                     result = hpage_collapse_scan_pmd(mm, vma, addr,
> > -                                                      &mmap_locked, cc);
> > -             }
> > +             result = khugepaged_collapse_single_pmd(addr, mm, vma, &mmap_locked, cc);
> > +
>
> you will be incrementing khugepaged_pages_collapsed at madvise_collapse by calling
> khugepaged_collapse_single_pmd which is not correct.

Ah good catch, Ill add a conditional around the incrementation. Thanks!

>
> >               if (!mmap_locked)
> >                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> >
> > -handle_result:
> >               switch (result) {
> >               case SCAN_SUCCEED:
> >               case SCAN_PMD_MAPPED:
> >                       ++thps;
> >                       break;
> >               case SCAN_PTE_MAPPED_HUGEPAGE:
> > -                     BUG_ON(mmap_locked);
> > -                     BUG_ON(*prev);
> > -                     mmap_read_lock(mm);
> > -                     result = collapse_pte_mapped_thp(mm, addr, true);
> > -                     mmap_read_unlock(mm);
> > -                     goto handle_result;
> > -             /* Whitelisted set of results where continuing OK */
> >               case SCAN_PMD_NULL:
> >               case SCAN_PTE_NON_PRESENT:
> >               case SCAN_PTE_UFFD_WP:
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
  2025-02-12 17:04   ` Usama Arif
@ 2025-02-17 20:55   ` Usama Arif
  2025-02-17 21:22     ` Nico Pache
  2025-02-18  4:22   ` Dev Jain
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Usama Arif @ 2025-02-17 20:55 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/2025 00:30, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning a PMD range for potential collapse candidates, keep track
> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> of max_ptes_none during the scan phase so we dont bailout early and miss
> potential mTHP candidates.
> 
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
> 
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 83 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c8048d9ec7fb..cd310989725b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;
>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	int result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	/*
> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		*mmap_locked = false;
>  	}
>  
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
>  
>  	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>  	/*
>  	 * Prevent all access to pagetables with the exception of
>  	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	vma_start_write(vma);
>  	anon_vma_lock_write(vma->anon_vma);
>  
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));
>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>  	/*
>  	 * This removes any huge TLB entry from the CPU so we won't allow
>  	 * huge and small TLB entries for the same virtual address to
> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>  
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	anon_vma_unlock_write(vma->anon_vma);
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { //mTHP
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);
> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		spin_lock(pte_ptl);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +		spin_unlock(pte_ptl);
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	}
>  
>  	folio = NULL;
>  
> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> +	int i;
>  	int result = SCAN_FAIL, referenced = 0;
>  	int none_or_zero = 0, shared = 0;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long _address;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>  	bool writable = false;
> -
> +	int chunk_none_count = 0;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
> +	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, _address += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		if (i % MIN_MTHP_NR == 0)
> +			chunk_none_count = 0;
> +
> +		_pte = pte + i;
> +		_address = address + i * PAGE_SIZE;
>  		pte_t pteval = ptep_get(_pte);
>  		if (is_swap_pte(pteval)) {
>  			++unmapped;
> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			++chunk_none_count;
>  			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (userfaultfd_armed(vma)) {

Its likely you might introduce a regression in khugepaged cpu usage over here.

For intel x86 machines that dont have TLB coalescing (AMD) or contpte (ARM),
there is a reduced benefit of mTHPs, I feel like a lot of people will never
change from the current kernel default, i.e. 2M THPs and fallback to 4K.

If you are only parsing 2M hugepages, early bailout when
none_or_zero <= khugepaged_max_ptes_none is a good optimization, and getting
rid of that will cause a regression? It might look a bit out of place,
but is it possible to keep this restriction of max_ptes_none during the
scanning phase if only PMD mappable THPs are allowed?



>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  				goto out_unmap;
>  			}
> +			continue;
>  		}
>  		if (pte_uffd_wp(pteval)) {
>  			/*
> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>  								     address)))
>  			referenced++;
> +
> +		/*
> +		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> +		 * pages keep track of it in the bitmap for mTHP collapsing.
> +		 */
> +		if (chunk_none_count < scaled_none &&
> +			(i + 1) % MIN_MTHP_NR == 0)
> +			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>  	}
> +
>  	if (!writable) {
>  		result = SCAN_PAGE_RO;
>  	} else if (cc->is_khugepaged &&
> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			tva_flags, THP_ORDERS_ALL_ANON);
> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> +			       mmap_locked, enabled_orders);
> +		if (result > 0)
> +			result = SCAN_SUCCEED;
> +		else
> +			result = SCAN_FAIL;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>  			fput(file);
>  			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>  				mmap_read_lock(mm);
> +				*mmap_locked = true;
>  				if (khugepaged_test_exit_or_disable(mm))
>  					goto end;
>  				result = collapse_pte_mapped_thp(mm, addr,
>  								 !cc->is_khugepaged);
>  				mmap_read_unlock(mm);
> +				*mmap_locked = false;
>  			}
>  		} else {
>  			result = khugepaged_scan_pmd(mm, vma, addr,



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-17 20:55   ` Usama Arif
@ 2025-02-17 21:22     ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-17 21:22 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Mon, Feb 17, 2025 at 1:55 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning a PMD range for potential collapse candidates, keep track
> > of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> > utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> > of max_ptes_none during the scan phase so we dont bailout early and miss
> > potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 83 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c8048d9ec7fb..cd310989725b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       /*
> > @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
> > @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> > @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       pte_unmap(pte);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> > @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { //mTHP
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             spin_lock(pte_ptl);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +             spin_unlock(pte_ptl);
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     }
> >
> >       folio = NULL;
> >
> > @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > +     int i;
> >       int result = SCAN_FAIL, referenced = 0;
> >       int none_or_zero = 0, shared = 0;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long _address;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >       bool writable = false;
> > -
> > +     int chunk_none_count = 0;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, _address += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             if (i % MIN_MTHP_NR == 0)
> > +                     chunk_none_count = 0;
> > +
> > +             _pte = pte + i;
> > +             _address = address + i * PAGE_SIZE;
> >               pte_t pteval = ptep_get(_pte);
> >               if (is_swap_pte(pteval)) {
> >                       ++unmapped;
> > @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +                     ++chunk_none_count;
> >                       ++none_or_zero;
> > -                     if (!userfaultfd_armed(vma) &&
> > -                         (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > -                             continue;
> > -                     } else {
> > +                     if (userfaultfd_armed(vma)) {
>
> Its likely you might introduce a regression in khugepaged cpu usage over here.
>
> For intel x86 machines that dont have TLB coalescing (AMD) or contpte (ARM),
> there is a reduced benefit of mTHPs, I feel like a lot of people will never
> change from the current kernel default, i.e. 2M THPs and fallback to 4K.
>
> If you are only parsing 2M hugepages, early bailout when
> none_or_zero <= khugepaged_max_ptes_none is a good optimization, and getting
> rid of that will cause a regression? It might look a bit out of place,
> but is it possible to keep this restriction of max_ptes_none during the
> scanning phase if only PMD mappable THPs are allowed?

That seems like a reasonable request, I will try to implement that. Thanks!

>
>
>
> >                               result = SCAN_EXCEED_NONE_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >                               goto out_unmap;
> >                       }
> > +                     continue;
> >               }
> >               if (pte_uffd_wp(pteval)) {
> >                       /*
> > @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                    folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
> >                                                                    address)))
> >                       referenced++;
> > +
> > +             /*
> > +              * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> > +              * pages keep track of it in the bitmap for mTHP collapsing.
> > +              */
> > +             if (chunk_none_count < scaled_none &&
> > +                     (i + 1) % MIN_MTHP_NR == 0)
> > +                     bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
> >       }
> > +
> >       if (!writable) {
> >               result = SCAN_PAGE_RO;
> >       } else if (cc->is_khugepaged &&
> > @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > -             result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +             enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +                     tva_flags, THP_ORDERS_ALL_ANON);
> > +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> > +                            mmap_locked, enabled_orders);
> > +             if (result > 0)
> > +                     result = SCAN_SUCCEED;
> > +             else
> > +                     result = SCAN_FAIL;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> > @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
> >                       fput(file);
> >                       if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >                               mmap_read_lock(mm);
> > +                             *mmap_locked = true;
> >                               if (khugepaged_test_exit_or_disable(mm))
> >                                       goto end;
> >                               result = collapse_pte_mapped_thp(mm, addr,
> >                                                                !cc->is_khugepaged);
> >                               mmap_read_unlock(mm);
> > +                             *mmap_locked = false;
> >                       }
> >               } else {
> >                       result = khugepaged_scan_pmd(mm, vma, addr,
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
  2025-02-12 17:04   ` Usama Arif
  2025-02-17 20:55   ` Usama Arif
@ 2025-02-18  4:22   ` Dev Jain
  2025-03-03 19:18     ` Nico Pache
  2025-02-19 16:52   ` Ryan Roberts
  2025-03-07  6:38   ` Dev Jain
  4 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-02-18  4:22 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/25 6:00 am, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning a PMD range for potential collapse candidates, keep track
> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> of max_ptes_none during the scan phase so we dont bailout early and miss
> potential mTHP candidates.
> 
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
> 
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>   1 file changed, 83 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c8048d9ec7fb..cd310989725b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   {
>   	LIST_HEAD(compound_pagelist);
>   	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;
>   	pgtable_t pgtable;
>   	struct folio *folio;
>   	spinlock_t *pmd_ptl, *pte_ptl;
>   	int result = SCAN_FAIL;
>   	struct vm_area_struct *vma;
>   	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
>   	/*
> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   		*mmap_locked = false;
>   	}
>   
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>   	if (result != SCAN_SUCCEED)
>   		goto out_nolock;
>   
>   	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>   	if (result != SCAN_SUCCEED) {
>   		mmap_read_unlock(mm);
>   		goto out_nolock;
> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   		 * released when it fails. So we jump out_nolock directly in
>   		 * that case.  Continuing to collapse causes inconsistency.
>   		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>   		if (result != SCAN_SUCCEED)
>   			goto out_nolock;
>   	}
>   
>   	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>   	/*
>   	 * Prevent all access to pagetables with the exception of
>   	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * mmap_lock.
>   	 */
>   	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   	/* check if the pmd is still valid */
> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	vma_start_write(vma);
>   	anon_vma_lock_write(vma->anon_vma);
>   
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));

As I have mentioned before, since you are isolating the PTE table in 
both cases, you need to do the mmu_notifier_* stuff for HPAGE_PMD_SIZE 
in any case. Check out patch 8 from my patchset.

>   	mmu_notifier_invalidate_range_start(&range);
>   
>   	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>   	/*
>   	 * This removes any huge TLB entry from the CPU so we won't allow
>   	 * huge and small TLB entries for the same virtual address to
> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	mmu_notifier_invalidate_range_end(&range);
>   	tlb_remove_table_sync_one();
>   
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>   	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>   		spin_unlock(pte_ptl);
>   	} else {
>   		result = SCAN_PMD_NULL;
> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	anon_vma_unlock_write(vma->anon_vma);
>   
>   	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>   	pte_unmap(pte);
>   	if (unlikely(result != SCAN_SUCCEED))
>   		goto out_up_write;
> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * write.
>   	 */
>   	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);

My personal opinion is that this if-else nesting looks really weird. 
This is why I separated out mTHP collapse into a separate function.


> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { //mTHP
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);
> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		spin_lock(pte_ptl);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +		spin_unlock(pte_ptl);
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	}
>   
>   	folio = NULL;
>   
> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   {
>   	pmd_t *pmd;
>   	pte_t *pte, *_pte;
> +	int i;
>   	int result = SCAN_FAIL, referenced = 0;
>   	int none_or_zero = 0, shared = 0;
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
>   	unsigned long _address;
> +	unsigned long enabled_orders;
>   	spinlock_t *ptl;
>   	int node = NUMA_NO_NODE, unmapped = 0;
>   	bool writable = false;
> -
> +	int chunk_none_count = 0;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>   	if (result != SCAN_SUCCEED)
>   		goto out;
>   
> +	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>   	nodes_clear(cc->alloc_nmask);
>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		goto out;
>   	}
>   
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, _address += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		if (i % MIN_MTHP_NR == 0)
> +			chunk_none_count = 0;
> +
> +		_pte = pte + i;
> +		_address = address + i * PAGE_SIZE;
>   		pte_t pteval = ptep_get(_pte);
>   		if (is_swap_pte(pteval)) {
>   			++unmapped;
> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   			}
>   		}
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			++chunk_none_count;
>   			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (userfaultfd_armed(vma)) {
>   				result = SCAN_EXCEED_NONE_PTE;
>   				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>   				goto out_unmap;
>   			}
> +			continue;
>   		}
>   		if (pte_uffd_wp(pteval)) {
>   			/*
> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>   								     address)))
>   			referenced++;
> +
> +		/*
> +		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> +		 * pages keep track of it in the bitmap for mTHP collapsing.
> +		 */
> +		if (chunk_none_count < scaled_none &&
> +			(i + 1) % MIN_MTHP_NR == 0)
> +			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>   	}
> +
>   	if (!writable) {
>   		result = SCAN_PAGE_RO;
>   	} else if (cc->is_khugepaged &&
> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   out_unmap:
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			tva_flags, THP_ORDERS_ALL_ANON);
> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> +			       mmap_locked, enabled_orders);
> +		if (result > 0)
> +			result = SCAN_SUCCEED;
> +		else
> +			result = SCAN_FAIL;
>   	}
>   out:
>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>   			fput(file);
>   			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>   				mmap_read_lock(mm);
> +				*mmap_locked = true;
>   				if (khugepaged_test_exit_or_disable(mm))
>   					goto end;
>   				result = collapse_pte_mapped_thp(mm, addr,
>   								 !cc->is_khugepaged);
>   				mmap_read_unlock(mm);
> +				*mmap_locked = false;
>   			}
>   		} else {
>   			result = khugepaged_scan_pmd(mm, vma, addr,



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (10 preceding siblings ...)
  2025-02-17  6:39 ` Dev Jain
@ 2025-02-18 16:07 ` Ryan Roberts
  2025-02-18 22:30   ` Nico Pache
  2025-02-19 17:00 ` Ryan Roberts
  12 siblings, 1 reply; 55+ messages in thread
From: Ryan Roberts @ 2025-02-18 16:07 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its 
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and 

nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
This is making my head hurt, but I *think* I agree with you that if
max_ptes_none is less than half of the number of ptes in a pmd, then creep
doesn't happen.

To make sure I've understood;

 - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
 - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
 - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
 - ...

So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
have 4 PTEs in a 32K area which is insufficient to collapse to 32K.

Sounds good to me!

> constantly promote mTHPs to the next available size.
> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>  Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>    monitoring a number of stats and tracepoints. The code is available 
>    here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>    The summary from my testings was that there was no significant regression
>    noticed through this test. In some cases my changes had better collapse
>    latencies, and was able to scan more pages in the same amount of time/work,
>    but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>   (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>    the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>    some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>   introduce khugepaged_collapse_single_pmd to unify khugepaged and
>     madvise_collapse
>   khugepaged: rename hpage_collapse_* to khugepaged_*
>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   khugepaged: generalize alloc_charge_folio for mTHP support
>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>   khugepaged: add mTHP support
>   khugepaged: improve tracepoints for mTHP orders
>   khugepaged: skip collapsing mTHP to smaller orders
> 
>  include/linux/khugepaged.h         |   4 +
>  include/trace/events/huge_memory.h |  34 ++-
>  mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>  3 files changed, 306 insertions(+), 154 deletions(-)
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-02-11  0:30 ` [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
  2025-02-17 17:11   ` Usama Arif
@ 2025-02-18 16:26   ` Ryan Roberts
  2025-02-18 22:24     ` Nico Pache
  1 sibling, 1 reply; 55+ messages in thread
From: Ryan Roberts @ 2025-02-18 16:26 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
> 
> Create khugepaged_collapse_single_pmd to increase code
> reuse and create an entry point for future khugepaged changes.
> 
> Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> the new khugepaged_collapse_single_pmd function.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
>  1 file changed, 50 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5f0be134141e..46faee67378b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2365,6 +2365,52 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *mm,
> +				   struct vm_area_struct *vma, bool *mmap_locked,
> +				   struct collapse_control *cc)

nit: given the vma links to the mm is it really neccessary to pass both? Why not
just pass vma?

> +{
> +	int result = SCAN_FAIL;
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> +
> +	if (!*mmap_locked) {
> +		mmap_read_lock(mm);
> +		*mmap_locked = true;
> +	}

AFAICT, the read lock is always held when khugepaged_collapse_single_pmd() is
called. Perhaps VM_WARN_ON(!*mmap_locked) would be more appropriate?

> +
> +	if (thp_vma_allowable_order(vma, vma->vm_flags,
> +					tva_flags, PMD_ORDER)) {
> +		if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {

I guess it was like this before, but what's the relevance of CONFIG_SHMEM?
Surely this should work for any file if CONFIG_READ_ONLY_THP_FOR_FS is enabled?

> +			struct file *file = get_file(vma->vm_file);
> +			pgoff_t pgoff = linear_page_index(vma, addr);
> +
> +			mmap_read_unlock(mm);
> +			*mmap_locked = false;
> +			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> +							  cc);
> +			fput(file);
> +			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +				mmap_read_lock(mm);
> +				if (hpage_collapse_test_exit_or_disable(mm))
> +					goto end;
> +				result = collapse_pte_mapped_thp(mm, addr,
> +								 !cc->is_khugepaged);
> +				mmap_read_unlock(mm);
> +			}
> +		} else {
> +			result = hpage_collapse_scan_pmd(mm, vma, addr,
> +							 mmap_locked, cc);
> +		}
> +		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> +			++khugepaged_pages_collapsed;

Looks like this counter was previously only incremented for the scan path, not
for the madvise_collapse path. Not sure if that's a problem?

Thanks,
Ryan

> +	}
> +end:
> +	return result;
> +}
> +
>  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  					    struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
> @@ -2439,33 +2485,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>  				  hend);
> -			if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> -				struct file *file = get_file(vma->vm_file);
> -				pgoff_t pgoff = linear_page_index(vma,
> -						khugepaged_scan.address);
>  
> -				mmap_read_unlock(mm);
> -				mmap_locked = false;
> -				*result = hpage_collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> -				fput(file);
> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -					mmap_read_lock(mm);
> -					if (hpage_collapse_test_exit_or_disable(mm))
> -						goto breakouterloop;
> -					*result = collapse_pte_mapped_thp(mm,
> -						khugepaged_scan.address, false);
> -					if (*result == SCAN_PMD_MAPPED)
> -						*result = SCAN_SUCCEED;
> -					mmap_read_unlock(mm);
> -				}
> -			} else {
> -				*result = hpage_collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> -			}
> -
> -			if (*result == SCAN_SUCCEED)
> -				++khugepaged_pages_collapsed;
> +			*result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
> +						mm, vma, &mmap_locked, cc);
>  
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> @@ -2785,36 +2807,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		mmap_assert_locked(mm);
>  		memset(cc->node_load, 0, sizeof(cc->node_load));
>  		nodes_clear(cc->alloc_nmask);
> -		if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> -			struct file *file = get_file(vma->vm_file);
> -			pgoff_t pgoff = linear_page_index(vma, addr);
>  
> -			mmap_read_unlock(mm);
> -			mmap_locked = false;
> -			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> -							  cc);
> -			fput(file);
> -		} else {
> -			result = hpage_collapse_scan_pmd(mm, vma, addr,
> -							 &mmap_locked, cc);
> -		}
> +		result = khugepaged_collapse_single_pmd(addr, mm, vma, &mmap_locked, cc);
> +
>  		if (!mmap_locked)
>  			*prev = NULL;  /* Tell caller we dropped mmap_lock */
>  
> -handle_result:
>  		switch (result) {
>  		case SCAN_SUCCEED:
>  		case SCAN_PMD_MAPPED:
>  			++thps;
>  			break;
>  		case SCAN_PTE_MAPPED_HUGEPAGE:
> -			BUG_ON(mmap_locked);
> -			BUG_ON(*prev);
> -			mmap_read_lock(mm);
> -			result = collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_read_unlock(mm);
> -			goto handle_result;
> -		/* Whitelisted set of results where continuing OK */
>  		case SCAN_PMD_NULL:
>  		case SCAN_PTE_NON_PRESENT:
>  		case SCAN_PTE_UFFD_WP:



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 2/9] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-02-11  0:30 ` [RFC v2 2/9] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
@ 2025-02-18 16:29   ` Ryan Roberts
  0 siblings, 0 replies; 55+ messages in thread
From: Ryan Roberts @ 2025-02-18 16:29 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> functions in khugepaged.c use a mix of hpage_collapse and khugepaged
> as the function prefix.
> 
> rename all of them to khugepaged to keep things consistent and slightly
> shorten the function names.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/khugepaged.c | 52 ++++++++++++++++++++++++-------------------------
>  1 file changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 46faee67378b..4c88d17250f4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
>  	kmem_cache_destroy(mm_slot_cache);
>  }
>  
> -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> +static inline int khugepaged_test_exit(struct mm_struct *mm)
>  {
>  	return atomic_read(&mm->mm_users) == 0;
>  }
>  
> -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> +static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
>  {
> -	return hpage_collapse_test_exit(mm) ||
> +	return khugepaged_test_exit(mm) ||
>  	       test_bit(MMF_DISABLE_THP, &mm->flags);
>  }
>  
> @@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
>  	int wakeup;
>  
>  	/* __khugepaged_exit() must not run from under us */
> -	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> +	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
>  	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
>  		return;
>  
> @@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  	} else if (mm_slot) {
>  		/*
>  		 * This is required to serialize against
> -		 * hpage_collapse_test_exit() (which is guaranteed to run
> +		 * khugepaged_test_exit() (which is guaranteed to run
>  		 * under mmap sem read mode). Stop here (after we return all
>  		 * pagetables will be destroyed) until khugepaged has finished
>  		 * working on the pagetables under the mmap_lock.
> @@ -606,7 +606,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		folio = page_folio(page);
>  		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>  
> -		/* See hpage_collapse_scan_pmd(). */
> +		/* See khugepaged_scan_pmd(). */
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
>  			if (cc->is_khugepaged &&
> @@ -851,7 +851,7 @@ struct collapse_control khugepaged_collapse_control = {
>  	.is_khugepaged = true,
>  };
>  
> -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> +static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>  {
>  	int i;
>  
> @@ -886,7 +886,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>  }
>  
>  #ifdef CONFIG_NUMA
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
>  	int nid, target_node = 0, max_value = 0;
>  
> @@ -905,7 +905,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
>  	return target_node;
>  }
>  #else
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
>  	return 0;
>  }
> @@ -925,7 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  	struct vm_area_struct *vma;
>  	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>  
> -	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +	if (unlikely(khugepaged_test_exit_or_disable(mm)))
>  		return SCAN_ANY_PROCESS;
>  
>  	*vmap = vma = find_vma(mm, address);
> @@ -992,7 +992,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>  
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
> - * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
>   *
>   * Called and returns without pte mapped or spinlocks held.
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> @@ -1078,7 +1078,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  {
>  	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>  		     GFP_TRANSHUGE);
> -	int node = hpage_collapse_find_target_node(cc);
> +	int node = khugepaged_find_target_node(cc);
>  	struct folio *folio;
>  
>  	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> @@ -1264,7 +1264,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	return result;
>  }
>  
> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> +static int khugepaged_scan_pmd(struct mm_struct *mm,
>  				   struct vm_area_struct *vma,
>  				   unsigned long address, bool *mmap_locked,
>  				   struct collapse_control *cc)
> @@ -1380,7 +1380,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		 * hit record.
>  		 */
>  		node = folio_nid(folio);
> -		if (hpage_collapse_scan_abort(node, cc)) {
> +		if (khugepaged_scan_abort(node, cc)) {
>  			result = SCAN_SCAN_ABORT;
>  			goto out_unmap;
>  		}
> @@ -1449,7 +1449,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>  
>  	lockdep_assert_held(&khugepaged_mm_lock);
>  
> -	if (hpage_collapse_test_exit(mm)) {
> +	if (khugepaged_test_exit(mm)) {
>  		/* free mm_slot */
>  		hash_del(&slot->hash);
>  		list_del(&slot->mm_node);
> @@ -1744,7 +1744,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
>  			continue;
>  
> -		if (hpage_collapse_test_exit(mm))
> +		if (khugepaged_test_exit(mm))
>  			continue;
>  		/*
>  		 * When a vma is registered with uffd-wp, we cannot recycle
> @@ -2266,7 +2266,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  
> -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> +static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
>  				    struct file *file, pgoff_t start,
>  				    struct collapse_control *cc)
>  {
> @@ -2311,7 +2311,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  
>  		node = folio_nid(folio);
> -		if (hpage_collapse_scan_abort(node, cc)) {
> +		if (khugepaged_scan_abort(node, cc)) {
>  			result = SCAN_SCAN_ABORT;
>  			break;
>  		}
> @@ -2357,7 +2357,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  #else
> -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> +static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
>  				    struct file *file, pgoff_t start,
>  				    struct collapse_control *cc)
>  {
> @@ -2389,19 +2389,19 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>  
>  			mmap_read_unlock(mm);
>  			*mmap_locked = false;
> -			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> +			result = khugepaged_scan_file(mm, addr, file, pgoff,
>  							  cc);
>  			fput(file);
>  			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>  				mmap_read_lock(mm);
> -				if (hpage_collapse_test_exit_or_disable(mm))
> +				if (khugepaged_test_exit_or_disable(mm))
>  					goto end;
>  				result = collapse_pte_mapped_thp(mm, addr,
>  								 !cc->is_khugepaged);
>  				mmap_read_unlock(mm);
>  			}
>  		} else {
> -			result = hpage_collapse_scan_pmd(mm, vma, addr,
> +			result = khugepaged_scan_pmd(mm, vma, addr,
>  							 mmap_locked, cc);
>  		}
>  		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> @@ -2449,7 +2449,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  		goto breakouterloop_mmap_lock;
>  
>  	progress++;
> -	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +	if (unlikely(khugepaged_test_exit_or_disable(mm)))
>  		goto breakouterloop;
>  
>  	vma_iter_init(&vmi, mm, khugepaged_scan.address);
> @@ -2457,7 +2457,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  		unsigned long hstart, hend;
>  
>  		cond_resched();
> -		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> +		if (unlikely(khugepaged_test_exit_or_disable(mm))) {
>  			progress++;
>  			break;
>  		}
> @@ -2479,7 +2479,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			bool mmap_locked = true;
>  
>  			cond_resched();
> -			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +			if (unlikely(khugepaged_test_exit_or_disable(mm)))
>  				goto breakouterloop;
>  
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
> @@ -2515,7 +2515,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  	 * Release the current mm_slot if this mm is about to die, or
>  	 * if we scanned all vmas of this mm.
>  	 */
> -	if (hpage_collapse_test_exit(mm) || !vma) {
> +	if (khugepaged_test_exit(mm) || !vma) {
>  		/*
>  		 * Make sure that if mm_users is reaching zero while
>  		 * khugepaged runs here, khugepaged_exit will find



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-02-18 16:26   ` Ryan Roberts
@ 2025-02-18 22:24     ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-18 22:24 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-kernel, linux-trace-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, dev.jain, sunnanyong,
	usamaarif642, audra, akpm, rostedt, mathieu.desnoyers, tiwai

On Tue, Feb 18, 2025 at 9:26 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create khugepaged_collapse_single_pmd to increase code
> > reuse and create an entry point for future khugepaged changes.
> >
> > Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> > the new khugepaged_collapse_single_pmd function.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
> >  1 file changed, 50 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5f0be134141e..46faee67378b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2365,6 +2365,52 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >  }
> >  #endif
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *mm,
> > +                                struct vm_area_struct *vma, bool *mmap_locked,
> > +                                struct collapse_control *cc)
>
> nit: given the vma links to the mm is it really neccessary to pass both? Why not
> just pass vma?
Ah good point!
>
> > +{
> > +     int result = SCAN_FAIL;
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> > +
> > +     if (!*mmap_locked) {
> > +             mmap_read_lock(mm);
> > +             *mmap_locked = true;
> > +     }
>
> AFAICT, the read lock is always held when khugepaged_collapse_single_pmd() is
> called. Perhaps VM_WARN_ON(!*mmap_locked) would be more appropriate?

Hmm, I'm actually not sure why that's there. Its probably left over
from a previous change I was trying a while back.

>
> > +
> > +     if (thp_vma_allowable_order(vma, vma->vm_flags,
> > +                                     tva_flags, PMD_ORDER)) {
> > +             if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
>
> I guess it was like this before, but what's the relevance of CONFIG_SHMEM?
> Surely this should work for any file if CONFIG_READ_ONLY_THP_FOR_FS is enabled?

Yeah I think David brought up a similar point during one of our
meetings. I'm just combining the collapse users, so I'll leave it for
now.

>
> > +                     struct file *file = get_file(vma->vm_file);
> > +                     pgoff_t pgoff = linear_page_index(vma, addr);
> > +
> > +                     mmap_read_unlock(mm);
> > +                     *mmap_locked = false;
> > +                     result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > +                                                       cc);
> > +                     fput(file);
> > +                     if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > +                             mmap_read_lock(mm);
> > +                             if (hpage_collapse_test_exit_or_disable(mm))
> > +                                     goto end;
> > +                             result = collapse_pte_mapped_thp(mm, addr,
> > +                                                              !cc->is_khugepaged);
> > +                             mmap_read_unlock(mm);
> > +                     }
> > +             } else {
> > +                     result = hpage_collapse_scan_pmd(mm, vma, addr,
> > +                                                      mmap_locked, cc);
> > +             }
> > +             if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> > +                     ++khugepaged_pages_collapsed;
>
> Looks like this counter was previously only incremented for the scan path, not
> for the madvise_collapse path. Not sure if that's a problem?

Yep! Usama noted that too, already fixed in the next version :)

Thanks!
-- Nico

>
> Thanks,
> Ryan
>
> > +     }
> > +end:
> > +     return result;
> > +}
> > +
> >  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                                           struct collapse_control *cc)
> >       __releases(&khugepaged_mm_lock)
> > @@ -2439,33 +2485,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                       VM_BUG_ON(khugepaged_scan.address < hstart ||
> >                                 khugepaged_scan.address + HPAGE_PMD_SIZE >
> >                                 hend);
> > -                     if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> > -                             struct file *file = get_file(vma->vm_file);
> > -                             pgoff_t pgoff = linear_page_index(vma,
> > -                                             khugepaged_scan.address);
> >
> > -                             mmap_read_unlock(mm);
> > -                             mmap_locked = false;
> > -                             *result = hpage_collapse_scan_file(mm,
> > -                                     khugepaged_scan.address, file, pgoff, cc);
> > -                             fput(file);
> > -                             if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > -                                     mmap_read_lock(mm);
> > -                                     if (hpage_collapse_test_exit_or_disable(mm))
> > -                                             goto breakouterloop;
> > -                                     *result = collapse_pte_mapped_thp(mm,
> > -                                             khugepaged_scan.address, false);
> > -                                     if (*result == SCAN_PMD_MAPPED)
> > -                                             *result = SCAN_SUCCEED;
> > -                                     mmap_read_unlock(mm);
> > -                             }
> > -                     } else {
> > -                             *result = hpage_collapse_scan_pmd(mm, vma,
> > -                                     khugepaged_scan.address, &mmap_locked, cc);
> > -                     }
> > -
> > -                     if (*result == SCAN_SUCCEED)
> > -                             ++khugepaged_pages_collapsed;
> > +                     *result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
> > +                                             mm, vma, &mmap_locked, cc);
> >
> >                       /* move to next address */
> >                       khugepaged_scan.address += HPAGE_PMD_SIZE;
> > @@ -2785,36 +2807,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >               mmap_assert_locked(mm);
> >               memset(cc->node_load, 0, sizeof(cc->node_load));
> >               nodes_clear(cc->alloc_nmask);
> > -             if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
> > -                     struct file *file = get_file(vma->vm_file);
> > -                     pgoff_t pgoff = linear_page_index(vma, addr);
> >
> > -                     mmap_read_unlock(mm);
> > -                     mmap_locked = false;
> > -                     result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > -                                                       cc);
> > -                     fput(file);
> > -             } else {
> > -                     result = hpage_collapse_scan_pmd(mm, vma, addr,
> > -                                                      &mmap_locked, cc);
> > -             }
> > +             result = khugepaged_collapse_single_pmd(addr, mm, vma, &mmap_locked, cc);
> > +
> >               if (!mmap_locked)
> >                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> >
> > -handle_result:
> >               switch (result) {
> >               case SCAN_SUCCEED:
> >               case SCAN_PMD_MAPPED:
> >                       ++thps;
> >                       break;
> >               case SCAN_PTE_MAPPED_HUGEPAGE:
> > -                     BUG_ON(mmap_locked);
> > -                     BUG_ON(*prev);
> > -                     mmap_read_lock(mm);
> > -                     result = collapse_pte_mapped_thp(mm, addr, true);
> > -                     mmap_read_unlock(mm);
> > -                     goto handle_result;
> > -             /* Whitelisted set of results where continuing OK */
> >               case SCAN_PMD_NULL:
> >               case SCAN_PTE_NON_PRESENT:
> >               case SCAN_PTE_UFFD_WP:
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-18 16:07 ` Ryan Roberts
@ 2025-02-18 22:30   ` Nico Pache
  2025-02-19  9:01     ` Dev Jain
  0 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-18 22:30 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-kernel, linux-trace-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, dev.jain, sunnanyong,
	usamaarif642, audra, akpm, rostedt, mathieu.desnoyers, tiwai

On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> > order to determine how full a THP must be to be eligible. If a mTHP collapse
> > is attempted, but contains swapped out, or shared pages, we dont perform the
> > collapse.
> >
> > With the default max_ptes_none=511, the code should keep its most of its
> > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>
> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
> This is making my head hurt, but I *think* I agree with you that if
> max_ptes_none is less than half of the number of ptes in a pmd, then creep
> doesn't happen.
Haha yea the compressed bitmap does not make the math super easy to
follow, but i'm glad we arrived at the same conclusion :)
>
> To make sure I've understood;
>
>  - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
>  - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
>  - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
>  - ...
>
> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
>
> Sounds good to me!
Great! Another easy way to think about it is, with max_ptes_none =
HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
for it to collapse again. Each size is 2x the last, so if we hit one
collapse, it will be eligible again next round.
>
> > constantly promote mTHPs to the next available size.
> >
> > Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> > Patch 2:     Refactor/rename hpage_collapse
> > Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> > Patch 6-9:   The mTHP patches
> >
> > ---------
> >  Testing
> > ---------
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - I created a test script that I used to push khugepaged to its limits while
> >    monitoring a number of stats and tracepoints. The code is available
> >    here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >    The summary from my testings was that there was no significant regression
> >    noticed through this test. In some cases my changes had better collapse
> >    latencies, and was able to scan more pages in the same amount of time/work,
> >    but for the most part the results were consistant.
> > - redis testing. I tested these changes along with my defer changes
> >   (see followup post for more details).
> > - some basic testing on 64k page size.
> > - lots of general use. These changes have been running in my VM for some time.
> >
> > Changes since V1 [2]:
> > - Minor bug fixes discovered during review and testing
> > - removed dynamic allocations for bitmaps, and made them stack based
> > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> > - Updated trace events to include collapsing order info.
> > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> > - No longer require a chunk to be fully utilized before setting the bit. Use
> >    the same max_ptes_none scaling principle to achieve this.
> > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >    some of the "creep" that was discovered in v1.
> >
> > [1] - https://gitlab.com/npache/khugepaged_mthp_test
> > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >
> > Nico Pache (9):
> >   introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >     madvise_collapse
> >   khugepaged: rename hpage_collapse_* to khugepaged_*
> >   khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >   khugepaged: generalize alloc_charge_folio for mTHP support
> >   khugepaged: generalize __collapse_huge_page_* for mTHP support
> >   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >   khugepaged: add mTHP support
> >   khugepaged: improve tracepoints for mTHP orders
> >   khugepaged: skip collapsing mTHP to smaller orders
> >
> >  include/linux/khugepaged.h         |   4 +
> >  include/trace/events/huge_memory.h |  34 ++-
> >  mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >  3 files changed, 306 insertions(+), 154 deletions(-)
> >
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-18 22:30   ` Nico Pache
@ 2025-02-19  9:01     ` Dev Jain
  2025-02-20 19:12       ` Nico Pache
  0 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-02-19  9:01 UTC (permalink / raw)
  To: Nico Pache, Ryan Roberts
  Cc: linux-kernel, linux-trace-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm, rostedt, mathieu.desnoyers, tiwai



On 19/02/25 4:00 am, Nico Pache wrote:
> On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 11/02/2025 00:30, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>> collapse.
>>>
>>> With the default max_ptes_none=511, the code should keep its most of its
>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>
>> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
>> This is making my head hurt, but I *think* I agree with you that if
>> max_ptes_none is less than half of the number of ptes in a pmd, then creep
>> doesn't happen.
> Haha yea the compressed bitmap does not make the math super easy to
> follow, but i'm glad we arrived at the same conclusion :)
>>
>> To make sure I've understood;
>>
>>   - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
>>   - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
>>   - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
>>   - ...
>>
>> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
>> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
>>
>> Sounds good to me!
> Great! Another easy way to think about it is, with max_ptes_none =
> HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
> for it to collapse again. Each size is 2x the last, so if we hit one
> collapse, it will be eligible again next round.

Please someone correct me if I am wrong.

Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64 
=> #chunks = 64 / 8 = 8.

Let the PTE state within the chunks be as follows:

Chunk 0: < 5 filled   Chunk 1: 5 filled   Chunk 2: 5 filled   Chunk 3: 5 
filled

Chunk 4: 5 filled   Chunk 5: < 5 filled   Chunk 6: < 5 filled   Chunk 7: 
< 5 filled

Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204 
< HPAGE_PMD_NR/2).
=> To collapse we need at least 60% of the PTEs filled.

Your algorithm marks chunks in the bitmap if 60% of the chunk is filled. 
Then, if the number of chunks set is greater than 60%, then we will 
collapse.

Chunk 0 will be marked zero because less than 5 PTEs are filled => 
percentage filled <= 50%

Right now the state is
0111 1000
where the indices are the chunk numbers.
Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse 
won't happen.

For the first 4 chunks, the percent filled is 75%.  So the state becomes
1111 1000
after 128K collapse, and now 256K collapse will happen.

Either I got this correct, or I do not understand the utility of 
maintaining chunks :) What you are doing is what I am doing except that 
my chunk size = 1.

>>
>>> constantly promote mTHPs to the next available size.
>>>
>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 2:     Refactor/rename hpage_collapse
>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>> Patch 6-9:   The mTHP patches
>>>
>>> ---------
>>>   Testing
>>> ---------
>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>> - selftests mm
>>> - I created a test script that I used to push khugepaged to its limits while
>>>     monitoring a number of stats and tracepoints. The code is available
>>>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>     The summary from my testings was that there was no significant regression
>>>     noticed through this test. In some cases my changes had better collapse
>>>     latencies, and was able to scan more pages in the same amount of time/work,
>>>     but for the most part the results were consistant.
>>> - redis testing. I tested these changes along with my defer changes
>>>    (see followup post for more details).
>>> - some basic testing on 64k page size.
>>> - lots of general use. These changes have been running in my VM for some time.
>>>
>>> Changes since V1 [2]:
>>> - Minor bug fixes discovered during review and testing
>>> - removed dynamic allocations for bitmaps, and made them stack based
>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>> - Updated trace events to include collapsing order info.
>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>     the same max_ptes_none scaling principle to achieve this.
>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>     some of the "creep" that was discovered in v1.
>>>
>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>
>>> Nico Pache (9):
>>>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>      madvise_collapse
>>>    khugepaged: rename hpage_collapse_* to khugepaged_*
>>>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>    khugepaged: generalize alloc_charge_folio for mTHP support
>>>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>    khugepaged: add mTHP support
>>>    khugepaged: improve tracepoints for mTHP orders
>>>    khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>   include/linux/khugepaged.h         |   4 +
>>>   include/trace/events/huge_memory.h |  34 ++-
>>>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>   3 files changed, 306 insertions(+), 154 deletions(-)
>>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 4/9] khugepaged: generalize alloc_charge_folio for mTHP support
  2025-02-11  0:30 ` [RFC v2 4/9] khugepaged: generalize alloc_charge_folio " Nico Pache
@ 2025-02-19 15:29   ` Ryan Roberts
  0 siblings, 0 replies; 55+ messages in thread
From: Ryan Roberts @ 2025-02-19 15:29 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> alloc_charge_folio allocates the new folio for the khugepaged collapse.
> Generalize the order of the folio allocations to support future mTHP
> collapsing.
> 
> No functional changes in this patch.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c834ea842847..0cfcdc11cabd 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1074,14 +1074,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  }
>  
>  static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, int order)
>  {
>  	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>  		     GFP_TRANSHUGE);
>  	int node = khugepaged_find_target_node(cc);
>  	struct folio *folio;
>  
> -	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> +	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>  	if (!folio) {
>  		*foliop = NULL;
>  		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);

Stats management is different for PMD-sized THP vs mTHP. All the PMD-sized THP
stats continue to be accumulated in /proc/meminfo (or whatever its called).
Other THP sizes are not accounted here. All mTHP sizes (*including* PMD-sized)
should accounted in
/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats/*. There is a file
for each stat.

We decided to do it this way for fear of breaking unenlightened user space that
only understands PMD-sized THP.

You can find the mTHP stats machinery at count_mthp_stat().

> @@ -1125,7 +1125,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 */
>  	mmap_read_unlock(mm);
>  
> -	result = alloc_charge_folio(&folio, mm, cc);
> +	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
> @@ -1851,7 +1851,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>  	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>  
> -	result = alloc_charge_folio(&new_folio, mm, cc);
> +	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-02-11  0:30 ` [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* " Nico Pache
@ 2025-02-19 15:39   ` Ryan Roberts
  2025-02-19 16:02     ` Nico Pache
  0 siblings, 1 reply; 55+ messages in thread
From: Ryan Roberts @ 2025-02-19 15:39 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* functions
> to support future mTHP collapse.
> 
> mTHP collapse can suffer from incosistant behavior, and memory waste
> "creep". disable swapin and shared support for mTHP collapse.
> 
> No functional changes in this patch.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 48 ++++++++++++++++++++++++++++--------------------
>  1 file changed, 28 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 0cfcdc11cabd..3776055bd477 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  					unsigned long address,
>  					pte_t *pte,
>  					struct collapse_control *cc,
> -					struct list_head *compound_pagelist)
> +					struct list_head *compound_pagelist,
> +					u8 order)

nit: I think we are mostly standardised on order being int. Is there any reason
to make it u8 here?

>  {
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>  	bool writable = false;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1 << order);
>  	     _pte++, address += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || (pte_present(pteval) &&
> @@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
>  			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			     none_or_zero <= scaled_none)) {
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		/* See khugepaged_scan_pmd(). */
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> +			    shared > khugepaged_max_ptes_shared)) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);

Same comment about events; I think you will want to be careful to only count
events for PMD-sized THP using count_vm_event() and introduce equivalent MTHP
events to cover all sizes.

>  				goto out;
> @@ -711,14 +713,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  						struct vm_area_struct *vma,
>  						unsigned long address,
>  						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +						struct list_head *compound_pagelist,
> +						u8 order)
>  {
>  	struct folio *src, *tmp;
>  	pte_t *_pte;
>  	pte_t pteval;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, address += PAGE_SIZE) {
> +	for (_pte = pte; _pte < pte + (1 << order);
> +		_pte++, address += PAGE_SIZE) {

nit: you changed the indentation here.

>  		pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> @@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  					     pmd_t *pmd,
>  					     pmd_t orig_pmd,
>  					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +					     struct list_head *compound_pagelist,
> +					     u8 order)
>  {
>  	spinlock_t *pmd_ptl;
>  
> @@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  	 * Release both raw and compound pages isolated
>  	 * in __collapse_huge_page_isolate.
>  	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
>  }
>  
>  /*
> @@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>  		unsigned long address, spinlock_t *ptl,
> -		struct list_head *compound_pagelist)
> +		struct list_head *compound_pagelist, u8 order)
>  {
>  	unsigned int i;
>  	int result = SCAN_SUCCEED;
> @@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  	/*
>  	 * Copying pages' contents is subject to memory poison at any iteration.
>  	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < (1 << order); i++) {
>  		pte_t pteval = ptep_get(pte + i);
>  		struct page *page = folio_page(folio, i);
>  		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  
>  	if (likely(result == SCAN_SUCCEED))
>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    compound_pagelist, order);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 compound_pagelist, order);
>  
>  	return result;
>  }
> @@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>  static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  				       struct vm_area_struct *vma,
>  				       unsigned long haddr, pmd_t *pmd,
> -				       int referenced)
> +				       int referenced, u8 order)
>  {
>  	int swapped_in = 0;
>  	vm_fault_t ret = 0;
> -	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long address, end = haddr + (PAGE_SIZE << order);
>  	int result;
>  	pte_t *pte = NULL;
>  	spinlock_t *ptl;
> @@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  		if (!is_swap_pte(vmf.orig_pte))
>  			continue;
>  
> +		if (order != HPAGE_PMD_ORDER) {
> +			result = SCAN_EXCEED_SWAP_PTE;
> +			goto out;
> +		}

A comment to explain the rationale for this divergent behaviour based on order
would be helpful.

> +
>  		vmf.pte = pte;
>  		vmf.ptl = ptl;
>  		ret = do_swap_page(&vmf);
> @@ -1114,7 +1123,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	int result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> -

nit: no need for this whitespace change?

>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	/*
> @@ -1149,7 +1157,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
>  		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced);
> +				referenced, HPAGE_PMD_ORDER);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1196,7 +1204,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>  	if (pte) {
>  		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      &compound_pagelist);
> +					&compound_pagelist, HPAGE_PMD_ORDER);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
> @@ -1226,7 +1234,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>  					   vma, address, pte_ptl,
> -					   &compound_pagelist);
> +					   &compound_pagelist, HPAGE_PMD_ORDER);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-02-19 15:39   ` Ryan Roberts
@ 2025-02-19 16:02     ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-19 16:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-kernel, linux-trace-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, dev.jain, sunnanyong,
	usamaarif642, audra, akpm, rostedt, mathieu.desnoyers, tiwai

On Wed, Feb 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > generalize the order of the __collapse_huge_page_* functions
> > to support future mTHP collapse.
> >
> > mTHP collapse can suffer from incosistant behavior, and memory waste
> > "creep". disable swapin and shared support for mTHP collapse.
> >
> > No functional changes in this patch.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 48 ++++++++++++++++++++++++++++--------------------
> >  1 file changed, 28 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 0cfcdc11cabd..3776055bd477 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                                       unsigned long address,
> >                                       pte_t *pte,
> >                                       struct collapse_control *cc,
> > -                                     struct list_head *compound_pagelist)
> > +                                     struct list_head *compound_pagelist,
> > +                                     u8 order)
>
> nit: I think we are mostly standardised on order being int. Is there any reason
> to make it u8 here?

The reasoning was I didn't want to consume a lot of memory for the
mthp_bitmap_stack.
originally the order and offset were u8's, but i had to convert the
offset to u16 to fit the max offset on 64k kernels.
so 64 * (8+16) = 192 bytes as opposed to 1024 bytes if they were ints.

Not sure if using these u8/16 is frowned upon. Lmk if I need to
convert these back to int or if they can stay!


>
> >  {
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       pte_t *_pte;
> >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> >       bool writable = false;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > +     for (_pte = pte; _pte < pte + (1 << order);
> >            _pte++, address += PAGE_SIZE) {
> >               pte_t pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || (pte_present(pteval) &&
> > @@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                       ++none_or_zero;
> >                       if (!userfaultfd_armed(vma) &&
> >                           (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > +                          none_or_zero <= scaled_none)) {
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               /* See khugepaged_scan_pmd(). */
> >               if (folio_likely_mapped_shared(folio)) {
> >                       ++shared;
> > -                     if (cc->is_khugepaged &&
> > -                         shared > khugepaged_max_ptes_shared) {
> > +                     if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> > +                         shared > khugepaged_max_ptes_shared)) {
> >                               result = SCAN_EXCEED_SHARED_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>
> Same comment about events; I think you will want to be careful to only count
> events for PMD-sized THP using count_vm_event() and introduce equivalent MTHP
> events to cover all sizes.

Makes sense, Ill work on adding the new counters for
THP_SCAN_EXCEED_(SWAP_PTE|NONE_PTE|SHARED_PTE). Thanks!

>
> >                               goto out;
> > @@ -711,14 +713,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >                                               struct vm_area_struct *vma,
> >                                               unsigned long address,
> >                                               spinlock_t *ptl,
> > -                                             struct list_head *compound_pagelist)
> > +                                             struct list_head *compound_pagelist,
> > +                                             u8 order)
> >  {
> >       struct folio *src, *tmp;
> >       pte_t *_pte;
> >       pte_t pteval;
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, address += PAGE_SIZE) {
> > +     for (_pte = pte; _pte < pte + (1 << order);
> > +             _pte++, address += PAGE_SIZE) {
>
> nit: you changed the indentation here.
>
> >               pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> >                       add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > @@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >                                            pmd_t *pmd,
> >                                            pmd_t orig_pmd,
> >                                            struct vm_area_struct *vma,
> > -                                          struct list_head *compound_pagelist)
> > +                                          struct list_head *compound_pagelist,
> > +                                          u8 order)
> >  {
> >       spinlock_t *pmd_ptl;
> >
> > @@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >        * Release both raw and compound pages isolated
> >        * in __collapse_huge_page_isolate.
> >        */
> > -     release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> > +     release_pte_pages(pte, pte + (1 << order), compound_pagelist);
> >  }
> >
> >  /*
> > @@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >  static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >               pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> >               unsigned long address, spinlock_t *ptl,
> > -             struct list_head *compound_pagelist)
> > +             struct list_head *compound_pagelist, u8 order)
> >  {
> >       unsigned int i;
> >       int result = SCAN_SUCCEED;
> > @@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >       /*
> >        * Copying pages' contents is subject to memory poison at any iteration.
> >        */
> > -     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +     for (i = 0; i < (1 << order); i++) {
> >               pte_t pteval = ptep_get(pte + i);
> >               struct page *page = folio_page(folio, i);
> >               unsigned long src_addr = address + i * PAGE_SIZE;
> > @@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >
> >       if (likely(result == SCAN_SUCCEED))
> >               __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> > -                                                 compound_pagelist);
> > +                                                 compound_pagelist, order);
> >       else
> >               __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> > -                                              compound_pagelist);
> > +                                              compound_pagelist, order);
> >
> >       return result;
> >  }
> > @@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> >  static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >                                      struct vm_area_struct *vma,
> >                                      unsigned long haddr, pmd_t *pmd,
> > -                                    int referenced)
> > +                                    int referenced, u8 order)
> >  {
> >       int swapped_in = 0;
> >       vm_fault_t ret = 0;
> > -     unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> > +     unsigned long address, end = haddr + (PAGE_SIZE << order);
> >       int result;
> >       pte_t *pte = NULL;
> >       spinlock_t *ptl;
> > @@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >               if (!is_swap_pte(vmf.orig_pte))
> >                       continue;
> >
> > +             if (order != HPAGE_PMD_ORDER) {
> > +                     result = SCAN_EXCEED_SWAP_PTE;
> > +                     goto out;
> > +             }
>
> A comment to explain the rationale for this divergent behaviour based on order
> would be helpful.
>
> > +
> >               vmf.pte = pte;
> >               vmf.ptl = ptl;
> >               ret = do_swap_page(&vmf);
> > @@ -1114,7 +1123,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > -
>
> nit: no need for this whitespace change?

Thanks! Ill clean up the nits and add a comment to the swapin function
to describe skipping mTHP swapin.
>
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       /*
> > @@ -1149,7 +1157,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> >               result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                                                  referenced);
> > +                             referenced, HPAGE_PMD_ORDER);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> > @@ -1196,7 +1204,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> >       if (pte) {
> >               result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                                   &compound_pagelist);
> > +                                     &compound_pagelist, HPAGE_PMD_ORDER);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> > @@ -1226,7 +1234,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> >                                          vma, address, pte_ptl,
> > -                                        &compound_pagelist);
> > +                                        &compound_pagelist, HPAGE_PMD_ORDER);
> >       pte_unmap(pte);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-02-11  0:30 ` [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
  2025-02-17  7:27   ` Dev Jain
  2025-02-17 19:12   ` Usama Arif
@ 2025-02-19 16:28   ` Ryan Roberts
  2025-02-20 18:48     ` Nico Pache
  2 siblings, 1 reply; 55+ messages in thread
From: Ryan Roberts @ 2025-02-19 16:28 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> mTHP support we use this scan to instead record chunks of fully utilized
> sections of the PMD.
> 
> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> by default we will set this to order 3. The reasoning is that for 4K 512

I'm still a bit confused by this (hopefully to be resolved as I'm about to read
the code); Does this imply that the smallest order you can collapse to is order
3? Because that would be different from the fault handler. For anon memory we
can support order-2 and above. I believe that these days files can support order-1.

There is a case for wanting to support order-2 for arm64. We have a (not yet
well deployed) technology called Hardware Page Aggregation (HPA) which can
automatically (transparent to SW) aggregate (usually) 4 contiguous pages into a
single TLB. I'd like the solution to be compatible with that.

> PMD size this results in a 64 bit bitmap which has some optimizations.
> For other arches like ARM64 64K, we can set a larger order if needed.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of utilized regions. We can then determine what
> mTHP size fits best and in the following patch, we set this bitmap while
> scanning the PMD.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> If a order is set to "always" lets always collapse to that order in a
> greedy manner.

This is not the policy that the fault handler uses, and I think we should use
the same policy in both places.

The fault handler gets a list of orders that are permitted for the VMA, then
prefers the highest orders in that list.

I don't think we should be preferring a smaller "always" order over a larger
"madvise" order if MADV_HUGEPAGE is set for the VMA (if that's what your
statement was suggesting).

> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/linux/khugepaged.h |  4 ++
>  mm/khugepaged.c            | 89 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 86 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 1f46046080f5..1fe0c4fc9d37 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,6 +1,10 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _LINUX_KHUGEPAGED_H
>  #define _LINUX_KHUGEPAGED_H
> +#define MIN_MTHP_ORDER	3
> +#define MIN_MTHP_NR	(1<<MIN_MTHP_ORDER)
> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE) - MIN_MTHP_ORDER))

I don't think you want "* PAGE_SIZE" here? I think MAX_MTHP_BITMAP_SIZE wants to
specify the maximum number of groups of MIN_MTHP_NR pte entries in a PTE table?

In that case, MAX_PTRS_PER_PTE will be 512 on x86_64. Your current formula will
give 262144 (which is 32KB!). I think you just need:

#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - MIN_MTHP_ORDER))

> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER))

Perhaps all of these macros need a KHUGEPAGED_ prefix? Otherwise MIN_MTHP_ORDER,
especially, is misleading. The min MTHP order is not 3.

>  
>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3776055bd477..c8048d9ec7fb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>  
> +struct scan_bit_state {
> +	u8 order;
> +	u16 offset;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>  
> @@ -102,6 +107,15 @@ struct collapse_control {
>  
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> +	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> +	.is_khugepaged = true,
>  };
>  
>  /**
> @@ -851,10 +865,6 @@ static void khugepaged_alloc_sleep(void)
>  	remove_wait_queue(&khugepaged_wait, &wait);
>  }
>  
> -struct collapse_control khugepaged_collapse_control = {
> -	.is_khugepaged = true,
> -};
> -
>  static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>  {
>  	int i;
> @@ -1112,7 +1122,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  
>  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, bool *mmap_locked,
> +				  u8 order, u16 offset)
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> @@ -1130,8 +1141,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * The allocation can take potentially a long time if it involves
>  	 * sync compaction, and we do not need to hold the mmap_lock during
>  	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>  	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}
>  
>  	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
> @@ -1266,12 +1281,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  out_up_write:
>  	mmap_write_unlock(mm);
>  out_nolock:
> +	*mmap_locked = false;
>  	if (folio)
>  		folio_put(folio);
>  	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>  	return result;
>  }
>  
> +// Recursive function to consume the bitmap
> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	u8 order, next_order;
> +	u16 offset, mid_offset;
> +	int num_chunks;
> +	int bits_set, threshold_bits;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order + MIN_MTHP_ORDER;
> +		offset = state.offset;
> +		num_chunks = 1 << (state.order);
> +		// Skip mTHP orders that are not enabled
> +		if (!test_bit(order, &enabled_orders))
> +			goto next;
> +
> +		// copy the relavant section to a new bitmap
> +		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> +				  MTHP_BITMAP_SIZE);
> +
> +		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> +		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> +				>> (HPAGE_PMD_ORDER - state.order);
> +
> +		//Check if the region is "almost full" based on the threshold
> +		if (bits_set > threshold_bits
> +			|| test_bit(order, &huge_anon_orders_always)) {
> +			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> +					mmap_locked, order, offset * MIN_MTHP_NR);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += (1 << order);
> +				continue;
> +			}
> +		}
> +
> +next:
> +		if (state.order > 0) {
> +			next_order = state.order - 1;
> +			mid_offset = offset + (num_chunks / 2);
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, mid_offset };
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, offset };
> +			}
> +	}
> +	return collapsed;
> +}

I'm struggling to understand the details of this function. I'll come back to it
when I have more time.

> +
>  static int khugepaged_scan_pmd(struct mm_struct *mm,
>  				   struct vm_area_struct *vma,
>  				   unsigned long address, bool *mmap_locked,
> @@ -1440,7 +1514,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*mmap_locked = false;

Given that collapse_huge_page() now takes mmap_locked and sets it to false on
return, I don't think we need this line here any longer?

>  	}
> @@ -2856,6 +2930,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	mmdrop(mm);
>  	kfree(cc);
>  
> +
>  	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>  			: madvise_collapse_errno(last_fail);
>  }



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
                     ` (2 preceding siblings ...)
  2025-02-18  4:22   ` Dev Jain
@ 2025-02-19 16:52   ` Ryan Roberts
  2025-03-03 19:13     ` Nico Pache
  2025-03-05  9:07     ` Dev Jain
  2025-03-07  6:38   ` Dev Jain
  4 siblings, 2 replies; 55+ messages in thread
From: Ryan Roberts @ 2025-02-19 16:52 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning a PMD range for potential collapse candidates, keep track
> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> of max_ptes_none during the scan phase so we dont bailout early and miss
> potential mTHP candidates.
> 
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
> 
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 83 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c8048d9ec7fb..cd310989725b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;
>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	int result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	/*
> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		*mmap_locked = false;
>  	}
>  
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
>  
>  	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>  	/*
>  	 * Prevent all access to pagetables with the exception of
>  	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	vma_start_write(vma);
>  	anon_vma_lock_write(vma->anon_vma);
>  
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));
>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>  	/*
>  	 * This removes any huge TLB entry from the CPU so we won't allow
>  	 * huge and small TLB entries for the same virtual address to
> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>  
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	anon_vma_unlock_write(vma->anon_vma);
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { //mTHP
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);
> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		spin_lock(pte_ptl);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +		spin_unlock(pte_ptl);
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);

I've only stared at this briefly, but it feels like there might be some bugs:

 - Why are you taking the pmd ptl? and calling pmd_populate? Surely the pte
table already exists and is attached to the pmd? So we are only need to update
the pte entries here? Or perhaps the whole pmd was previously isolated?

 - I think some arches use a single PTL for all levels of the pgtable? So in
this case it's probably not a good idea to nest the pmd and pte spin lock?

 - Given the pte PTL is dropped then reacquired, is there any way that the ptes
could have changed under us? Is any revalidation required? Perhaps not if pte
table was removed from the PMD.

 - I would have guessed the memory ordering you want from smp_wmb() would
already be handled by the spin_unlock()?


> +	}
>  
>  	folio = NULL;
>  
> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> +	int i;
>  	int result = SCAN_FAIL, referenced = 0;
>  	int none_or_zero = 0, shared = 0;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long _address;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>  	bool writable = false;
> -
> +	int chunk_none_count = 0;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
> +	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, _address += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		if (i % MIN_MTHP_NR == 0)
> +			chunk_none_count = 0;
> +
> +		_pte = pte + i;
> +		_address = address + i * PAGE_SIZE;
>  		pte_t pteval = ptep_get(_pte);
>  		if (is_swap_pte(pteval)) {
>  			++unmapped;
> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			++chunk_none_count;
>  			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (userfaultfd_armed(vma)) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  				goto out_unmap;
>  			}
> +			continue;
>  		}
>  		if (pte_uffd_wp(pteval)) {
>  			/*
> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>  								     address)))
>  			referenced++;
> +
> +		/*
> +		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> +		 * pages keep track of it in the bitmap for mTHP collapsing.
> +		 */
> +		if (chunk_none_count < scaled_none &&
> +			(i + 1) % MIN_MTHP_NR == 0)
> +			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>  	}
> +
>  	if (!writable) {
>  		result = SCAN_PAGE_RO;
>  	} else if (cc->is_khugepaged &&
> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			tva_flags, THP_ORDERS_ALL_ANON);
> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> +			       mmap_locked, enabled_orders);
> +		if (result > 0)
> +			result = SCAN_SUCCEED;
> +		else
> +			result = SCAN_FAIL;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>  			fput(file);
>  			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>  				mmap_read_lock(mm);
> +				*mmap_locked = true;
>  				if (khugepaged_test_exit_or_disable(mm))
>  					goto end;
>  				result = collapse_pte_mapped_thp(mm, addr,
>  								 !cc->is_khugepaged);
>  				mmap_read_unlock(mm);
> +				*mmap_locked = false;
>  			}
>  		} else {
>  			result = khugepaged_scan_pmd(mm, vma, addr,



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 9/9] khugepaged: skip collapsing mTHP to smaller orders
  2025-02-11  0:30 ` [RFC v2 9/9] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-02-19 16:57   ` Ryan Roberts
  0 siblings, 0 replies; 55+ messages in thread
From: Ryan Roberts @ 2025-02-19 16:57 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> some pages being unmapped. Skip these cases until we have a way to check
> if its ok to collapse to a smaller mTHP size (like in the case of a
> partially mapped folio).
> 
> This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
> 
> [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e2ba18e57064..fc30698b8e6e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -622,6 +622,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		folio = page_folio(page);
>  		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>  
> +		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
> +			result = SCAN_PTE_MAPPED_HUGEPAGE;
> +			goto out;
> +		}

One of the key areas where we want to benefit from khugepaged collapsing to mTHP
is when a COW event happens. If the original folio was large, then it becomes
partially mapped due to COW and we want to collapse it again. I think this will
prevent that?

I made some fairly detailed suggestions for what I think is the right approach
in the context of Dev's series. It would be good to get your thoughts on that:

https://lore.kernel.org/lkml/aa647830-cf55-48f0-98c2-8230796e35b3@arm.com/

Thanks,
Ryan

> +
>  		/* See khugepaged_scan_pmd(). */
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
                   ` (11 preceding siblings ...)
  2025-02-18 16:07 ` Ryan Roberts
@ 2025-02-19 17:00 ` Ryan Roberts
  12 siblings, 0 replies; 55+ messages in thread
From: Ryan Roberts @ 2025-02-19 17:00 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai

On 11/02/2025 00:30, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. 

Having been through the code I don't think you can handle the case of a
partially covered PMD? That was never important before because by definition we
only cared about PMD-sized chunks, but it would be good to be able to collapse
to appropriately sized mTHP, memory within a VMA that doesn't cover an entire
PMD. Do you have plans to add that? Is the approach you've taken amenable to
extension in this way?

> This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its 
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and 
> constantly promote mTHPs to the next available size.
> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>  Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>    monitoring a number of stats and tracepoints. The code is available 
>    here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>    The summary from my testings was that there was no significant regression
>    noticed through this test. In some cases my changes had better collapse
>    latencies, and was able to scan more pages in the same amount of time/work,
>    but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>   (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>    the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>    some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>   introduce khugepaged_collapse_single_pmd to unify khugepaged and
>     madvise_collapse
>   khugepaged: rename hpage_collapse_* to khugepaged_*
>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   khugepaged: generalize alloc_charge_folio for mTHP support
>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>   khugepaged: add mTHP support
>   khugepaged: improve tracepoints for mTHP orders
>   khugepaged: skip collapsing mTHP to smaller orders
> 
>  include/linux/khugepaged.h         |   4 +
>  include/trace/events/huge_memory.h |  34 ++-
>  mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>  3 files changed, 306 insertions(+), 154 deletions(-)
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-02-19 16:28   ` Ryan Roberts
@ 2025-02-20 18:48     ` Nico Pache
  0 siblings, 0 replies; 55+ messages in thread
From: Nico Pache @ 2025-02-20 18:48 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-kernel, linux-trace-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, dev.jain, sunnanyong,
	usamaarif642, audra, akpm, rostedt, mathieu.desnoyers, tiwai

On Wed, Feb 19, 2025 at 9:28 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> > mTHP support we use this scan to instead record chunks of fully utilized
> > sections of the PMD.
> >
> > create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> > by default we will set this to order 3. The reasoning is that for 4K 512
>
> I'm still a bit confused by this (hopefully to be resolved as I'm about to read
> the code); Does this imply that the smallest order you can collapse to is order
> 3? Because that would be different from the fault handler. For anon memory we
> can support order-2 and above. I believe that these days files can support order-1.

Yes, it may have been a premature optimization. I will test with
MTHP_MIN_ORDER=2 and compare!

>
> There is a case for wanting to support order-2 for arm64. We have a (not yet
> well deployed) technology called Hardware Page Aggregation (HPA) which can
> automatically (transparent to SW) aggregate (usually) 4 contiguous pages into a
> single TLB. I'd like the solution to be compatible with that.

Sounds reasonable, especially if the hardware support is going to give
that size a huge boost.

>
> > PMD size this results in a 64 bit bitmap which has some optimizations.
> > For other arches like ARM64 64K, we can set a larger order if needed.
> >
> > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> > that represents chunks of utilized regions. We can then determine what
> > mTHP size fits best and in the following patch, we set this bitmap while
> > scanning the PMD.
> >
> > max_ptes_none is used as a scale to determine how "full" an order must
> > be before being considered for collapse.
> >
> > If a order is set to "always" lets always collapse to that order in a
> > greedy manner.
>
> This is not the policy that the fault handler uses, and I think we should use
> the same policy in both places.
>
> The fault handler gets a list of orders that are permitted for the VMA, then
> prefers the highest orders in that list.
>
> I don't think we should be preferring a smaller "always" order over a larger
> "madvise" order if MADV_HUGEPAGE is set for the VMA (if that's what your
> statement was suggesting).

It does start at the highest order. All this means that if you have
PMD=always
1024kB=madvise

the PMD collapse will always happen (if we dont want this behavior lmk!)

>
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  include/linux/khugepaged.h |  4 ++
> >  mm/khugepaged.c            | 89 +++++++++++++++++++++++++++++++++++---
> >  2 files changed, 86 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index 1f46046080f5..1fe0c4fc9d37 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -1,6 +1,10 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >  #ifndef _LINUX_KHUGEPAGED_H
> >  #define _LINUX_KHUGEPAGED_H
> > +#define MIN_MTHP_ORDER       3
> > +#define MIN_MTHP_NR  (1<<MIN_MTHP_ORDER)
> > +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE) - MIN_MTHP_ORDER))
>
> I don't think you want "* PAGE_SIZE" here? I think MAX_MTHP_BITMAP_SIZE wants to
> specify the maximum number of groups of MIN_MTHP_NR pte entries in a PTE table?
>
> In that case, MAX_PTRS_PER_PTE will be 512 on x86_64. Your current formula will
> give 262144 (which is 32KB!). I think you just need:

Yes that is correct! Thanks for pointing that out. the bitmap size is
supposed to be 64 not 262144! that should save some memory :P

>
> #define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - MIN_MTHP_ORDER))
>
> > +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
>
> Perhaps all of these macros need a KHUGEPAGED_ prefix? Otherwise MIN_MTHP_ORDER,
> especially, is misleading. The min MTHP order is not 3.


I will add the prefixes, thanks!

>
> >
> >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 3776055bd477..c8048d9ec7fb 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +struct scan_bit_state {
> > +     u8 order;
> > +     u16 offset;
> > +};
> > +
> >  struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -102,6 +107,15 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> > +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> > +};
> > +
> > +struct collapse_control khugepaged_collapse_control = {
> > +     .is_khugepaged = true,
> >  };
> >
> >  /**
> > @@ -851,10 +865,6 @@ static void khugepaged_alloc_sleep(void)
> >       remove_wait_queue(&khugepaged_wait, &wait);
> >  }
> >
> > -struct collapse_control khugepaged_collapse_control = {
> > -     .is_khugepaged = true,
> > -};
> > -
> >  static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> >  {
> >       int i;
> > @@ -1112,7 +1122,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >
> >  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                             int referenced, int unmapped,
> > -                           struct collapse_control *cc)
> > +                           struct collapse_control *cc, bool *mmap_locked,
> > +                               u8 order, u16 offset)
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > @@ -1130,8 +1141,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * The allocation can take potentially a long time if it involves
> >        * sync compaction, and we do not need to hold the mmap_lock during
> >        * that. We will recheck the vma after taking it again in write mode.
> > +      * If collapsing mTHPs we may have already released the read_lock.
> >        */
> > -     mmap_read_unlock(mm);
> > +     if (*mmap_locked) {
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> > +     }
> >
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> > @@ -1266,12 +1281,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  out_up_write:
> >       mmap_write_unlock(mm);
> >  out_nolock:
> > +     *mmap_locked = false;
> >       if (folio)
> >               folio_put(folio);
> >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> >       return result;
> >  }
> >
> > +// Recursive function to consume the bitmap
> > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > +                     int referenced, int unmapped, struct collapse_control *cc,
> > +                     bool *mmap_locked, unsigned long enabled_orders)
> > +{
> > +     u8 order, next_order;
> > +     u16 offset, mid_offset;
> > +     int num_chunks;
> > +     int bits_set, threshold_bits;
> > +     int top = -1;
> > +     int collapsed = 0;
> > +     int ret;
> > +     struct scan_bit_state state;
> > +
> > +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +             { HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> > +
> > +     while (top >= 0) {
> > +             state = cc->mthp_bitmap_stack[top--];
> > +             order = state.order + MIN_MTHP_ORDER;
> > +             offset = state.offset;
> > +             num_chunks = 1 << (state.order);
> > +             // Skip mTHP orders that are not enabled
> > +             if (!test_bit(order, &enabled_orders))
> > +                     goto next;
> > +
> > +             // copy the relavant section to a new bitmap
> > +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> > +                               MTHP_BITMAP_SIZE);
> > +
> > +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> > +             threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> > +                             >> (HPAGE_PMD_ORDER - state.order);
> > +
> > +             //Check if the region is "almost full" based on the threshold
> > +             if (bits_set > threshold_bits
> > +                     || test_bit(order, &huge_anon_orders_always)) {
> > +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> > +                                     mmap_locked, order, offset * MIN_MTHP_NR);
> > +                     if (ret == SCAN_SUCCEED) {
> > +                             collapsed += (1 << order);
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +next:
> > +             if (state.order > 0) {
> > +                     next_order = state.order - 1;
> > +                     mid_offset = offset + (num_chunks / 2);
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, mid_offset };
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, offset };
> > +                     }
> > +     }
> > +     return collapsed;
> > +}
>
> I'm struggling to understand the details of this function. I'll come back to it
> when I have more time.
Hopefully the presentation helped a little. This is a recursive
function that uses a stack instead of function calling. This was to
remove the recursion, and to make the result handling much easier.
Basic idea is to start at PMD and work your way down until you find
that the bitmap satisfies the conditions for collapse. If it doesn't
collapse, it will add two new collapse attempts to the stack (order--,
left and right chunks)

>
> > +
> >  static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                                  struct vm_area_struct *vma,
> >                                  unsigned long address, bool *mmap_locked,
> > @@ -1440,7 +1514,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc);
> > +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> >               /* collapse_huge_page will return with the mmap_lock released */
> >               *mmap_locked = false;
>
> Given that collapse_huge_page() now takes mmap_locked and sets it to false on
> return, I don't think we need this line here any longer?

I think that is correct! Thanks

>
> >       }
> > @@ -2856,6 +2930,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >       mmdrop(mm);
> >       kfree(cc);
> >
> > +
> >       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> >                       : madvise_collapse_errno(last_fail);
> >  }
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-19  9:01     ` Dev Jain
@ 2025-02-20 19:12       ` Nico Pache
  2025-02-21  4:57         ` Dev Jain
  0 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-02-20 19:12 UTC (permalink / raw)
  To: Dev Jain
  Cc: Ryan Roberts, linux-kernel, linux-trace-kernel, linux-mm,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Wed, Feb 19, 2025 at 2:01 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 19/02/25 4:00 am, Nico Pache wrote:
> > On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 11/02/2025 00:30, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> >>> order to determine how full a THP must be to be eligible. If a mTHP collapse
> >>> is attempted, but contains swapped out, or shared pages, we dont perform the
> >>> collapse.
> >>>
> >>> With the default max_ptes_none=511, the code should keep its most of its
> >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> >>
> >> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
> >> This is making my head hurt, but I *think* I agree with you that if
> >> max_ptes_none is less than half of the number of ptes in a pmd, then creep
> >> doesn't happen.
> > Haha yea the compressed bitmap does not make the math super easy to
> > follow, but i'm glad we arrived at the same conclusion :)
> >>
> >> To make sure I've understood;
> >>
> >>   - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
> >>   - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
> >>   - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
> >>   - ...
> >>
> >> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
> >> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
> >>
> >> Sounds good to me!
> > Great! Another easy way to think about it is, with max_ptes_none =
> > HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
> > for it to collapse again. Each size is 2x the last, so if we hit one
> > collapse, it will be eligible again next round.
>
> Please someone correct me if I am wrong.
>

max_ptes_none = 204
scaled_none = (204 >> 9 - 3) = ~3.1
so 4 pages need to be available in each chunk for the bit to be set, not 5.

at 204 the bitmap check is
512 - 1 - 204 = 307
(PMD) 307 >> 3 = 38
(1024k) 307 >> 4 = 19
(512k) 307 >> 5 = 9
(256k) 307 >> 6 = 4

> Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64
> => #chunks = 64 / 8 = 8.
>
> Let the PTE state within the chunks be as follows:
>
> Chunk 0: < 5 filled   Chunk 1: 5 filled   Chunk 2: 5 filled   Chunk 3: 5
> filled
>
> Chunk 4: 5 filled   Chunk 5: < 5 filled   Chunk 6: < 5 filled   Chunk 7:
> < 5 filled
>
> Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204
> < HPAGE_PMD_NR/2).
> => To collapse we need at least 60% of the PTEs filled.
>
> Your algorithm marks chunks in the bitmap if 60% of the chunk is filled.
> Then, if the number of chunks set is greater than 60%, then we will
> collapse.
>
> Chunk 0 will be marked zero because less than 5 PTEs are filled =>
> percentage filled <= 50%
>
> Right now the state is
> 0111 1000
> where the indices are the chunk numbers.
> Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse
> won't happen.
>
> For the first 4 chunks, the percent filled is 75%.  So the state becomes
> 1111 1000
> after 128K collapse, and now 256K collapse will happen.
>
> Either I got this correct, or I do not understand the utility of
> maintaining chunks :) What you are doing is what I am doing except that
> my chunk size = 1.

Ignoring all the math, and just going off the 0111 1000
We do "creep", but its not the same type of "creep" we've been
describing. The collapse in the first half will allow the collapse in
order++ to collapse but it stops there and doesnt keep getting
promoted to a PMD size. That is unless the adjacent 256k also has some
bits set, then it can collapse to 512k. So I guess we still can creep,
but its way less aggressive and only when there is actual memory being
utilized in the adjacent chunk, so it's not like we are creating a
huge waste.


>
> >>
> >>> constantly promote mTHPs to the next available size.
> >>>
> >>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 2:     Refactor/rename hpage_collapse
> >>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>> Patch 6-9:   The mTHP patches
> >>>
> >>> ---------
> >>>   Testing
> >>> ---------
> >>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>> - selftests mm
> >>> - I created a test script that I used to push khugepaged to its limits while
> >>>     monitoring a number of stats and tracepoints. The code is available
> >>>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >>>     The summary from my testings was that there was no significant regression
> >>>     noticed through this test. In some cases my changes had better collapse
> >>>     latencies, and was able to scan more pages in the same amount of time/work,
> >>>     but for the most part the results were consistant.
> >>> - redis testing. I tested these changes along with my defer changes
> >>>    (see followup post for more details).
> >>> - some basic testing on 64k page size.
> >>> - lots of general use. These changes have been running in my VM for some time.
> >>>
> >>> Changes since V1 [2]:
> >>> - Minor bug fixes discovered during review and testing
> >>> - removed dynamic allocations for bitmaps, and made them stack based
> >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>> - Updated trace events to include collapsing order info.
> >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> >>> - No longer require a chunk to be fully utilized before setting the bit. Use
> >>>     the same max_ptes_none scaling principle to achieve this.
> >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >>>     some of the "creep" that was discovered in v1.
> >>>
> >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >>>
> >>> Nico Pache (9):
> >>>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >>>      madvise_collapse
> >>>    khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>    khugepaged: generalize alloc_charge_folio for mTHP support
> >>>    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>    khugepaged: add mTHP support
> >>>    khugepaged: improve tracepoints for mTHP orders
> >>>    khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>   include/linux/khugepaged.h         |   4 +
> >>>   include/trace/events/huge_memory.h |  34 ++-
> >>>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >>>   3 files changed, 306 insertions(+), 154 deletions(-)
> >>>
> >>
> >
> >
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/9] khugepaged: mTHP support
  2025-02-20 19:12       ` Nico Pache
@ 2025-02-21  4:57         ` Dev Jain
  0 siblings, 0 replies; 55+ messages in thread
From: Dev Jain @ 2025-02-21  4:57 UTC (permalink / raw)
  To: Nico Pache
  Cc: Ryan Roberts, linux-kernel, linux-trace-kernel, linux-mm,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 21/02/25 12:42 am, Nico Pache wrote:
> On Wed, Feb 19, 2025 at 2:01 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 19/02/25 4:00 am, Nico Pache wrote:
>>> On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 11/02/2025 00:30, Nico Pache wrote:
>>>>> The following series provides khugepaged and madvise collapse with the
>>>>> capability to collapse regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>>>> collapse.
>>>>>
>>>>> With the default max_ptes_none=511, the code should keep its most of its
>>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>>>
>>>> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
>>>> This is making my head hurt, but I *think* I agree with you that if
>>>> max_ptes_none is less than half of the number of ptes in a pmd, then creep
>>>> doesn't happen.
>>> Haha yea the compressed bitmap does not make the math super easy to
>>> follow, but i'm glad we arrived at the same conclusion :)
>>>>
>>>> To make sure I've understood;
>>>>
>>>>    - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
>>>>    - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
>>>>    - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
>>>>    - ...
>>>>
>>>> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
>>>> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
>>>>
>>>> Sounds good to me!
>>> Great! Another easy way to think about it is, with max_ptes_none =
>>> HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
>>> for it to collapse again. Each size is 2x the last, so if we hit one
>>> collapse, it will be eligible again next round.
>>
>> Please someone correct me if I am wrong.
>>
> 
> max_ptes_none = 204
> scaled_none = (204 >> 9 - 3) = ~3.1
> so 4 pages need to be available in each chunk for the bit to be set, not 5.
> 
> at 204 the bitmap check is
> 512 - 1 - 204 = 307
> (PMD) 307 >> 3 = 38
> (1024k) 307 >> 4 = 19
> (512k) 307 >> 5 = 9
> (256k) 307 >> 6 = 4
> 
>> Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64
>> => #chunks = 64 / 8 = 8.
>>
>> Let the PTE state within the chunks be as follows:
>>
>> Chunk 0: < 5 filled   Chunk 1: 5 filled   Chunk 2: 5 filled   Chunk 3: 5
>> filled
>>
>> Chunk 4: 5 filled   Chunk 5: < 5 filled   Chunk 6: < 5 filled   Chunk 7:
>> < 5 filled
>>
>> Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204
>> < HPAGE_PMD_NR/2).
>> => To collapse we need at least 60% of the PTEs filled.
>>
>> Your algorithm marks chunks in the bitmap if 60% of the chunk is filled.
>> Then, if the number of chunks set is greater than 60%, then we will
>> collapse.
>>
>> Chunk 0 will be marked zero because less than 5 PTEs are filled =>
>> percentage filled <= 50%
>>
>> Right now the state is
>> 0111 1000
>> where the indices are the chunk numbers.
>> Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse
>> won't happen.
>>
>> For the first 4 chunks, the percent filled is 75%.  So the state becomes
>> 1111 1000
>> after 128K collapse, and now 256K collapse will happen.
>>
>> Either I got this correct, or I do not understand the utility of
>> maintaining chunks :) What you are doing is what I am doing except that
>> my chunk size = 1.
> 
> Ignoring all the math, and just going off the 0111 1000
> We do "creep", but its not the same type of "creep" we've been
> describing. The collapse in the first half will allow the collapse in
> order++ to collapse but it stops there and doesnt keep getting
> promoted to a PMD size. That is unless the adjacent 256k also has some
> bits set, then it can collapse to 512k. So I guess we still can creep,
> but its way less aggressive and only when there is actual memory being
> utilized in the adjacent chunk, so it's not like we are creating a
> huge waste.

I get you. You will creep when the adjacent chunk has at least 1 bit 
set. I don't really have a strong opinion on this one.

> 
> 
>>
>>>>
>>>>> constantly promote mTHPs to the next available size.
>>>>>
>>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>> Patch 6-9:   The mTHP patches
>>>>>
>>>>> ---------
>>>>>    Testing
>>>>> ---------
>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>> - selftests mm
>>>>> - I created a test script that I used to push khugepaged to its limits while
>>>>>      monitoring a number of stats and tracepoints. The code is available
>>>>>      here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>>>      The summary from my testings was that there was no significant regression
>>>>>      noticed through this test. In some cases my changes had better collapse
>>>>>      latencies, and was able to scan more pages in the same amount of time/work,
>>>>>      but for the most part the results were consistant.
>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>     (see followup post for more details).
>>>>> - some basic testing on 64k page size.
>>>>> - lots of general use. These changes have been running in my VM for some time.
>>>>>
>>>>> Changes since V1 [2]:
>>>>> - Minor bug fixes discovered during review and testing
>>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>> - Updated trace events to include collapsing order info.
>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>>>      the same max_ptes_none scaling principle to achieve this.
>>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>>>      some of the "creep" that was discovered in v1.
>>>>>
>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>>>
>>>>> Nico Pache (9):
>>>>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>>       madvise_collapse
>>>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>     khugepaged: add mTHP support
>>>>>     khugepaged: improve tracepoints for mTHP orders
>>>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>>>
>>>>>    include/linux/khugepaged.h         |   4 +
>>>>>    include/trace/events/huge_memory.h |  34 ++-
>>>>>    mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>>>    3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>
>>>>
>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-19 16:52   ` Ryan Roberts
@ 2025-03-03 19:13     ` Nico Pache
  2025-03-05  9:11       ` Dev Jain
  2025-03-05  9:07     ` Dev Jain
  1 sibling, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-03-03 19:13 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-kernel, linux-trace-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, dev.jain, sunnanyong,
	usamaarif642, audra, akpm, rostedt, mathieu.desnoyers, tiwai

On Wed, Feb 19, 2025 at 9:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning a PMD range for potential collapse candidates, keep track
> > of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> > utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> > of max_ptes_none during the scan phase so we dont bailout early and miss
> > potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 83 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c8048d9ec7fb..cd310989725b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       /*
> > @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
> > @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> > @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       pte_unmap(pte);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> > @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { //mTHP
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             spin_lock(pte_ptl);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +             spin_unlock(pte_ptl);
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
>
> I've only stared at this briefly, but it feels like there might be some bugs:

Sorry for the delayed response, I needed to catch up on some other
work and wanted to make sure I looked into your questions before
answering.
>
>  - Why are you taking the pmd ptl? and calling pmd_populate? Surely the pte
> table already exists and is attached to the pmd? So we are only need to update
> the pte entries here? Or perhaps the whole pmd was previously isolated?

The previous locking behavior is kept; however, because we are not
installing a NEW pmd we need to repopulate the old PMD (like we do in
the fail case). The PMD entry was cleared to avoid GUP-fast races.
>
>  - I think some arches use a single PTL for all levels of the pgtable? So in
> this case it's probably not a good idea to nest the pmd and pte spin lock?

Thanks for pointing that out, I corrected it by making sure they dont nest!

>
>  - Given the pte PTL is dropped then reacquired, is there any way that the ptes
> could have changed under us? Is any revalidation required? Perhaps not if pte
> table was removed from the PMD.

Correct, I believe we dont even need to take the PTL because of all
the write locks we took-- but for now i'm trying to keep the locking
changes to a minimum. We can focus on locking optimizations later.

>
>  - I would have guessed the memory ordering you want from smp_wmb() would
> already be handled by the spin_unlock()?

Yes I think that is correct, I noticed other callers doing this, but
on a second pass those are all lockless, so in this case we dont need
it.

>
>
> > +     }
> >
> >       folio = NULL;
> >
> > @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > +     int i;
> >       int result = SCAN_FAIL, referenced = 0;
> >       int none_or_zero = 0, shared = 0;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long _address;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >       bool writable = false;
> > -
> > +     int chunk_none_count = 0;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, _address += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             if (i % MIN_MTHP_NR == 0)
> > +                     chunk_none_count = 0;
> > +
> > +             _pte = pte + i;
> > +             _address = address + i * PAGE_SIZE;
> >               pte_t pteval = ptep_get(_pte);
> >               if (is_swap_pte(pteval)) {
> >                       ++unmapped;
> > @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +                     ++chunk_none_count;
> >                       ++none_or_zero;
> > -                     if (!userfaultfd_armed(vma) &&
> > -                         (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > -                             continue;
> > -                     } else {
> > +                     if (userfaultfd_armed(vma)) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >                               goto out_unmap;
> >                       }
> > +                     continue;
> >               }
> >               if (pte_uffd_wp(pteval)) {
> >                       /*
> > @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                    folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
> >                                                                    address)))
> >                       referenced++;
> > +
> > +             /*
> > +              * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> > +              * pages keep track of it in the bitmap for mTHP collapsing.
> > +              */
> > +             if (chunk_none_count < scaled_none &&
> > +                     (i + 1) % MIN_MTHP_NR == 0)
> > +                     bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
> >       }
> > +
> >       if (!writable) {
> >               result = SCAN_PAGE_RO;
> >       } else if (cc->is_khugepaged &&
> > @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > -             result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +             enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +                     tva_flags, THP_ORDERS_ALL_ANON);
> > +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> > +                            mmap_locked, enabled_orders);
> > +             if (result > 0)
> > +                     result = SCAN_SUCCEED;
> > +             else
> > +                     result = SCAN_FAIL;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> > @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
> >                       fput(file);
> >                       if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >                               mmap_read_lock(mm);
> > +                             *mmap_locked = true;
> >                               if (khugepaged_test_exit_or_disable(mm))
> >                                       goto end;
> >                               result = collapse_pte_mapped_thp(mm, addr,
> >                                                                !cc->is_khugepaged);
> >                               mmap_read_unlock(mm);
> > +                             *mmap_locked = false;
> >                       }
> >               } else {
> >                       result = khugepaged_scan_pmd(mm, vma, addr,
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-18  4:22   ` Dev Jain
@ 2025-03-03 19:18     ` Nico Pache
  2025-03-04  5:10       ` Dev Jain
  0 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-03-03 19:18 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, akpm, rostedt, mathieu.desnoyers, tiwai

On Mon, Feb 17, 2025 at 9:23 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/02/25 6:00 am, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning a PMD range for potential collapse candidates, keep track
> > of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> > utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> > of max_ptes_none during the scan phase so we dont bailout early and miss
> > potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
> >   1 file changed, 83 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c8048d9ec7fb..cd310989725b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >   {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       /*
> > @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
>
> As I have mentioned before, since you are isolating the PTE table in
> both cases, you need to do the mmu_notifier_* stuff for HPAGE_PMD_SIZE
> in any case. Check out patch 8 from my patchset.

Why do we need to invalidate the whole PMD if we are only changing a
section of it and no one can touch this memory?
>
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
> > @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> > @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       pte_unmap(pte);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> > @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
>
> My personal opinion is that this if-else nesting looks really weird.
> This is why I separated out mTHP collapse into a separate function.
>
>
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { //mTHP
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             spin_lock(pte_ptl);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +             spin_unlock(pte_ptl);
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     }
> >
> >       folio = NULL;
> >
> > @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >   {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > +     int i;
> >       int result = SCAN_FAIL, referenced = 0;
> >       int none_or_zero = 0, shared = 0;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long _address;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >       bool writable = false;
> > -
> > +     int chunk_none_count = 0;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, _address += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             if (i % MIN_MTHP_NR == 0)
> > +                     chunk_none_count = 0;
> > +
> > +             _pte = pte + i;
> > +             _address = address + i * PAGE_SIZE;
> >               pte_t pteval = ptep_get(_pte);
> >               if (is_swap_pte(pteval)) {
> >                       ++unmapped;
> > @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +                     ++chunk_none_count;
> >                       ++none_or_zero;
> > -                     if (!userfaultfd_armed(vma) &&
> > -                         (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > -                             continue;
> > -                     } else {
> > +                     if (userfaultfd_armed(vma)) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >                               goto out_unmap;
> >                       }
> > +                     continue;
> >               }
> >               if (pte_uffd_wp(pteval)) {
> >                       /*
> > @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                    folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
> >                                                                    address)))
> >                       referenced++;
> > +
> > +             /*
> > +              * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> > +              * pages keep track of it in the bitmap for mTHP collapsing.
> > +              */
> > +             if (chunk_none_count < scaled_none &&
> > +                     (i + 1) % MIN_MTHP_NR == 0)
> > +                     bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
> >       }
> > +
> >       if (!writable) {
> >               result = SCAN_PAGE_RO;
> >       } else if (cc->is_khugepaged &&
> > @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >   out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > -             result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +             enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +                     tva_flags, THP_ORDERS_ALL_ANON);
> > +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> > +                            mmap_locked, enabled_orders);
> > +             if (result > 0)
> > +                     result = SCAN_SUCCEED;
> > +             else
> > +                     result = SCAN_FAIL;
> >       }
> >   out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> > @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
> >                       fput(file);
> >                       if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >                               mmap_read_lock(mm);
> > +                             *mmap_locked = true;
> >                               if (khugepaged_test_exit_or_disable(mm))
> >                                       goto end;
> >                               result = collapse_pte_mapped_thp(mm, addr,
> >                                                                !cc->is_khugepaged);
> >                               mmap_read_unlock(mm);
> > +                             *mmap_locked = false;
> >                       }
> >               } else {
> >                       result = khugepaged_scan_pmd(mm, vma, addr,
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-03-03 19:18     ` Nico Pache
@ 2025-03-04  5:10       ` Dev Jain
  0 siblings, 0 replies; 55+ messages in thread
From: Dev Jain @ 2025-03-04  5:10 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, akpm, rostedt, mathieu.desnoyers, tiwai



On 04/03/25 12:48 am, Nico Pache wrote:
> On Mon, Feb 17, 2025 at 9:23 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 11/02/25 6:00 am, Nico Pache wrote:
>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>> While scanning a PMD range for potential collapse candidates, keep track
>>> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
>>> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
>>> of max_ptes_none during the scan phase so we dont bailout early and miss
>>> potential mTHP candidates.
>>>
>>> After the scan is complete we will perform binary recursion on the
>>> bitmap to determine which mTHP size would be most efficient to collapse
>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>> determine how full a THP must be to be eligible.
>>>
>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>> pages, we dont perform the collapse.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>    mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>>>    1 file changed, 83 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index c8048d9ec7fb..cd310989725b 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>    {
>>>        LIST_HEAD(compound_pagelist);
>>>        pmd_t *pmd, _pmd;
>>> -     pte_t *pte;
>>> +     pte_t *pte, mthp_pte;
>>>        pgtable_t pgtable;
>>>        struct folio *folio;
>>>        spinlock_t *pmd_ptl, *pte_ptl;
>>>        int result = SCAN_FAIL;
>>>        struct vm_area_struct *vma;
>>>        struct mmu_notifier_range range;
>>> +     unsigned long _address = address + offset * PAGE_SIZE;
>>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>>
>>>        /*
>>> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                *mmap_locked = false;
>>>        }
>>>
>>> -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>> +     result = alloc_charge_folio(&folio, mm, cc, order);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out_nolock;
>>>
>>>        mmap_read_lock(mm);
>>> -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>>> +     *mmap_locked = true;
>>> +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>>        if (result != SCAN_SUCCEED) {
>>>                mmap_read_unlock(mm);
>>>                goto out_nolock;
>>> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                 * released when it fails. So we jump out_nolock directly in
>>>                 * that case.  Continuing to collapse causes inconsistency.
>>>                 */
>>> -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>>> -                             referenced, HPAGE_PMD_ORDER);
>>> +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
>>> +                             referenced, order);
>>>                if (result != SCAN_SUCCEED)
>>>                        goto out_nolock;
>>>        }
>>>
>>>        mmap_read_unlock(mm);
>>> +     *mmap_locked = false;
>>>        /*
>>>         * Prevent all access to pagetables with the exception of
>>>         * gup_fast later handled by the ptep_clear_flush and the VM
>>> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * mmap_lock.
>>>         */
>>>        mmap_write_lock(mm);
>>> -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>>> +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out_up_write;
>>>        /* check if the pmd is still valid */
>>> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        vma_start_write(vma);
>>>        anon_vma_lock_write(vma->anon_vma);
>>>
>>> -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -                             address + HPAGE_PMD_SIZE);
>>> +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>> +                             _address + (PAGE_SIZE << order));
>>
>> As I have mentioned before, since you are isolating the PTE table in
>> both cases, you need to do the mmu_notifier_* stuff for HPAGE_PMD_SIZE
>> in any case. Check out patch 8 from my patchset.
> 
> Why do we need to invalidate the whole PMD if we are only changing a
> section of it and no one can touch this memory?

I confess I do not understand mmu_notifiers properly, but on a proper 
look I guess you are correct; they are used to indicate to secondary 
TLBs that the TLB entry is now stale for the corresponding pte. w.r.t 
that you make sense, I just assumed that we are in a sense 
"invalidating" the whole PMD region since we are isolating the PTE table.

>>
>>>        mmu_notifier_invalidate_range_start(&range);
>>>
>>>        pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> +
>>>        /*
>>>         * This removes any huge TLB entry from the CPU so we won't allow
>>>         * huge and small TLB entries for the same virtual address to
>>> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        mmu_notifier_invalidate_range_end(&range);
>>>        tlb_remove_table_sync_one();
>>>
>>> -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>>>        if (pte) {
>>> -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
>>> -                                     &compound_pagelist, HPAGE_PMD_ORDER);
>>> +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
>>> +                                     &compound_pagelist, order);
>>>                spin_unlock(pte_ptl);
>>>        } else {
>>>                result = SCAN_PMD_NULL;
>>> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        anon_vma_unlock_write(vma->anon_vma);
>>>
>>>        result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>> -                                        vma, address, pte_ptl,
>>> -                                        &compound_pagelist, HPAGE_PMD_ORDER);
>>> +                                        vma, _address, pte_ptl,
>>> +                                        &compound_pagelist, order);
>>>        pte_unmap(pte);
>>>        if (unlikely(result != SCAN_SUCCEED))
>>>                goto out_up_write;
>>> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * write.
>>>         */
>>>        __folio_mark_uptodate(folio);
>>> -     pgtable = pmd_pgtable(_pmd);
>>> -
>>> -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>>> -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>> -
>>> -     spin_lock(pmd_ptl);
>>> -     BUG_ON(!pmd_none(*pmd));
>>> -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>> -     folio_add_lru_vma(folio, vma);
>>> -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> -     set_pmd_at(mm, address, pmd, _pmd);
>>> -     update_mmu_cache_pmd(vma, address, pmd);
>>> -     deferred_split_folio(folio, false);
>>> -     spin_unlock(pmd_ptl);
>>
>> My personal opinion is that this if-else nesting looks really weird.
>> This is why I separated out mTHP collapse into a separate function.
>>
>>
>>> +     if (order == HPAGE_PMD_ORDER) {
>>> +             pgtable = pmd_pgtable(_pmd);
>>> +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>>> +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>>> +             BUG_ON(!pmd_none(*pmd));
>>> +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>>> +             folio_add_lru_vma(folio, vma);
>>> +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> +             set_pmd_at(mm, address, pmd, _pmd);
>>> +             update_mmu_cache_pmd(vma, address, pmd);
>>> +             deferred_split_folio(folio, false);
>>> +             spin_unlock(pmd_ptl);
>>> +     } else { //mTHP
>>> +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
>>> +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>>> +             folio_ref_add(folio, (1 << order) - 1);
>>> +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>>> +             folio_add_lru_vma(folio, vma);
>>> +             spin_lock(pte_ptl);
>>> +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
>>> +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
>>> +             spin_unlock(pte_ptl);
>>> +             smp_wmb(); /* make pte visible before pmd */
>>> +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>> +             deferred_split_folio(folio, false);
>>> +             spin_unlock(pmd_ptl);
>>> +     }
>>>
>>>        folio = NULL;
>>>
>>> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>    {
>>>        pmd_t *pmd;
>>>        pte_t *pte, *_pte;
>>> +     int i;
>>>        int result = SCAN_FAIL, referenced = 0;
>>>        int none_or_zero = 0, shared = 0;
>>>        struct page *page = NULL;
>>>        struct folio *folio = NULL;
>>>        unsigned long _address;
>>> +     unsigned long enabled_orders;
>>>        spinlock_t *ptl;
>>>        int node = NUMA_NO_NODE, unmapped = 0;
>>>        bool writable = false;
>>> -
>>> +     int chunk_none_count = 0;
>>> +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
>>> +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>>
>>>        result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out;
>>>
>>> +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
>>> +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>>>        memset(cc->node_load, 0, sizeof(cc->node_load));
>>>        nodes_clear(cc->alloc_nmask);
>>>        pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                goto out;
>>>        }
>>>
>>> -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> -          _pte++, _address += PAGE_SIZE) {
>>> +     for (i = 0; i < HPAGE_PMD_NR; i++) {
>>> +             if (i % MIN_MTHP_NR == 0)
>>> +                     chunk_none_count = 0;
>>> +
>>> +             _pte = pte + i;
>>> +             _address = address + i * PAGE_SIZE;
>>>                pte_t pteval = ptep_get(_pte);
>>>                if (is_swap_pte(pteval)) {
>>>                        ++unmapped;
>>> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                        }
>>>                }
>>>                if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> +                     ++chunk_none_count;
>>>                        ++none_or_zero;
>>> -                     if (!userfaultfd_armed(vma) &&
>>> -                         (!cc->is_khugepaged ||
>>> -                          none_or_zero <= khugepaged_max_ptes_none)) {
>>> -                             continue;
>>> -                     } else {
>>> +                     if (userfaultfd_armed(vma)) {
>>>                                result = SCAN_EXCEED_NONE_PTE;
>>>                                count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>>                                goto out_unmap;
>>>                        }
>>> +                     continue;
>>>                }
>>>                if (pte_uffd_wp(pteval)) {
>>>                        /*
>>> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>>>                                                                     address)))
>>>                        referenced++;
>>> +
>>> +             /*
>>> +              * we are reading in MIN_MTHP_NR page chunks. if there are no empty
>>> +              * pages keep track of it in the bitmap for mTHP collapsing.
>>> +              */
>>> +             if (chunk_none_count < scaled_none &&
>>> +                     (i + 1) % MIN_MTHP_NR == 0)
>>> +                     bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>>>        }
>>> +
>>>        if (!writable) {
>>>                result = SCAN_PAGE_RO;
>>>        } else if (cc->is_khugepaged &&
>>> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>    out_unmap:
>>>        pte_unmap_unlock(pte, ptl);
>>>        if (result == SCAN_SUCCEED) {
>>> -             result = collapse_huge_page(mm, address, referenced,
>>> -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>>> -             /* collapse_huge_page will return with the mmap_lock released */
>>> -             *mmap_locked = false;
>>> +             enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>>> +                     tva_flags, THP_ORDERS_ALL_ANON);
>>> +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
>>> +                            mmap_locked, enabled_orders);
>>> +             if (result > 0)
>>> +                     result = SCAN_SUCCEED;
>>> +             else
>>> +                     result = SCAN_FAIL;
>>>        }
>>>    out:
>>>        trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
>>> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>>>                        fput(file);
>>>                        if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>>                                mmap_read_lock(mm);
>>> +                             *mmap_locked = true;
>>>                                if (khugepaged_test_exit_or_disable(mm))
>>>                                        goto end;
>>>                                result = collapse_pte_mapped_thp(mm, addr,
>>>                                                                 !cc->is_khugepaged);
>>>                                mmap_read_unlock(mm);
>>> +                             *mmap_locked = false;
>>>                        }
>>>                } else {
>>>                        result = khugepaged_scan_pmd(mm, vma, addr,
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-19 16:52   ` Ryan Roberts
  2025-03-03 19:13     ` Nico Pache
@ 2025-03-05  9:07     ` Dev Jain
  1 sibling, 0 replies; 55+ messages in thread
From: Dev Jain @ 2025-03-05  9:07 UTC (permalink / raw)
  To: Ryan Roberts, Nico Pache, linux-kernel, linux-trace-kernel,
	linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 19/02/25 10:22 pm, Ryan Roberts wrote:
> On 11/02/2025 00:30, Nico Pache wrote:
>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>> While scanning a PMD range for potential collapse candidates, keep track
>> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
>> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
>> of max_ptes_none during the scan phase so we dont bailout early and miss
>> potential mTHP candidates.
>>
>> After the scan is complete we will perform binary recursion on the
>> bitmap to determine which mTHP size would be most efficient to collapse
>> to. max_ptes_none will be scaled by the attempted collapse order to
>> determine how full a THP must be to be eligible.
>>
>> If a mTHP collapse is attempted, but contains swapped out, or shared
>> pages, we dont perform the collapse.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 83 insertions(+), 39 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index c8048d9ec7fb..cd310989725b 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   {
>>   	LIST_HEAD(compound_pagelist);
>>   	pmd_t *pmd, _pmd;
>> -	pte_t *pte;
>> +	pte_t *pte, mthp_pte;
>>   	pgtable_t pgtable;
>>   	struct folio *folio;
>>   	spinlock_t *pmd_ptl, *pte_ptl;
>>   	int result = SCAN_FAIL;
>>   	struct vm_area_struct *vma;
>>   	struct mmu_notifier_range range;
>> +	unsigned long _address = address + offset * PAGE_SIZE;
>>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>   
>>   	/*
>> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   		*mmap_locked = false;
>>   	}
>>   
>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out_nolock;
>>   
>>   	mmap_read_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>> +	*mmap_locked = true;
>> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>   	if (result != SCAN_SUCCEED) {
>>   		mmap_read_unlock(mm);
>>   		goto out_nolock;
>> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   		 * released when it fails. So we jump out_nolock directly in
>>   		 * that case.  Continuing to collapse causes inconsistency.
>>   		 */
>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -				referenced, HPAGE_PMD_ORDER);
>> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
>> +				referenced, order);
>>   		if (result != SCAN_SUCCEED)
>>   			goto out_nolock;
>>   	}
>>   
>>   	mmap_read_unlock(mm);
>> +	*mmap_locked = false;
>>   	/*
>>   	 * Prevent all access to pagetables with the exception of
>>   	 * gup_fast later handled by the ptep_clear_flush and the VM
>> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	 * mmap_lock.
>>   	 */
>>   	mmap_write_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out_up_write;
>>   	/* check if the pmd is still valid */
>> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	vma_start_write(vma);
>>   	anon_vma_lock_write(vma->anon_vma);
>>   
>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> -				address + HPAGE_PMD_SIZE);
>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>> +				_address + (PAGE_SIZE << order));
>>   	mmu_notifier_invalidate_range_start(&range);
>>   
>>   	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>> +
>>   	/*
>>   	 * This removes any huge TLB entry from the CPU so we won't allow
>>   	 * huge and small TLB entries for the same virtual address to
>> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	mmu_notifier_invalidate_range_end(&range);
>>   	tlb_remove_table_sync_one();
>>   
>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>>   	if (pte) {
>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -					&compound_pagelist, HPAGE_PMD_ORDER);
>> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
>> +					&compound_pagelist, order);
>>   		spin_unlock(pte_ptl);
>>   	} else {
>>   		result = SCAN_PMD_NULL;
>> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	anon_vma_unlock_write(vma->anon_vma);
>>   
>>   	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>> -					   vma, address, pte_ptl,
>> -					   &compound_pagelist, HPAGE_PMD_ORDER);
>> +					   vma, _address, pte_ptl,
>> +					   &compound_pagelist, order);
>>   	pte_unmap(pte);
>>   	if (unlikely(result != SCAN_SUCCEED))
>>   		goto out_up_write;
>> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	 * write.
>>   	 */
>>   	__folio_mark_uptodate(folio);
>> -	pgtable = pmd_pgtable(_pmd);
>> -
>> -	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>> -
>> -	spin_lock(pmd_ptl);
>> -	BUG_ON(!pmd_none(*pmd));
>> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>> -	folio_add_lru_vma(folio, vma);
>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -	set_pmd_at(mm, address, pmd, _pmd);
>> -	update_mmu_cache_pmd(vma, address, pmd);
>> -	deferred_split_folio(folio, false);
>> -	spin_unlock(pmd_ptl);
>> +	if (order == HPAGE_PMD_ORDER) {
>> +		pgtable = pmd_pgtable(_pmd);
>> +		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>> +
>> +		spin_lock(pmd_ptl);
>> +		BUG_ON(!pmd_none(*pmd));
>> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>> +		folio_add_lru_vma(folio, vma);
>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +		set_pmd_at(mm, address, pmd, _pmd);
>> +		update_mmu_cache_pmd(vma, address, pmd);
>> +		deferred_split_folio(folio, false);
>> +		spin_unlock(pmd_ptl);
>> +	} else { //mTHP
>> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
>> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
>> +
>> +		spin_lock(pmd_ptl);
>> +		folio_ref_add(folio, (1 << order) - 1);
>> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>> +		folio_add_lru_vma(folio, vma);
>> +		spin_lock(pte_ptl);
>> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
>> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
>> +		spin_unlock(pte_ptl);
>> +		smp_wmb(); /* make pte visible before pmd */
>> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> +		deferred_split_folio(folio, false);
>> +		spin_unlock(pmd_ptl);
> 
> I've only stared at this briefly, but it feels like there might be some bugs:
> 
>   - Why are you taking the pmd ptl? and calling pmd_populate? Surely the pte
> table already exists and is attached to the pmd? So we are only need to update
> the pte entries here? Or perhaps the whole pmd was previously isolated?
> 
>   - I think some arches use a single PTL for all levels of the pgtable? So in
> this case it's probably not a good idea to nest the pmd and pte spin lock?
> 
>   - Given the pte PTL is dropped then reacquired, is there any way that the ptes
> could have changed under us? Is any revalidation required? Perhaps not if pte
> table was removed from the PMD.
> 
>   - I would have guessed the memory ordering you want from smp_wmb() would
> already be handled by the spin_unlock()?

Not sure if I understand this; in pmd_install(), we take the PMD lock 
and before dropping the lock, do smp_wmb(), how is this case different?

> 
> 
>> +	}
>>   
>>   	folio = NULL;
>>   
>> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   {
>>   	pmd_t *pmd;
>>   	pte_t *pte, *_pte;
>> +	int i;
>>   	int result = SCAN_FAIL, referenced = 0;
>>   	int none_or_zero = 0, shared = 0;
>>   	struct page *page = NULL;
>>   	struct folio *folio = NULL;
>>   	unsigned long _address;
>> +	unsigned long enabled_orders;
>>   	spinlock_t *ptl;
>>   	int node = NUMA_NO_NODE, unmapped = 0;
>>   	bool writable = false;
>> -
>> +	int chunk_none_count = 0;
>> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
>> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>   
>>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out;
>>   
>> +	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
>> +	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>>   	nodes_clear(cc->alloc_nmask);
>>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   		goto out;
>>   	}
>>   
>> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> -	     _pte++, _address += PAGE_SIZE) {
>> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
>> +		if (i % MIN_MTHP_NR == 0)
>> +			chunk_none_count = 0;
>> +
>> +		_pte = pte + i;
>> +		_address = address + i * PAGE_SIZE;
>>   		pte_t pteval = ptep_get(_pte);
>>   		if (is_swap_pte(pteval)) {
>>   			++unmapped;
>> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   			}
>>   		}
>>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> +			++chunk_none_count;
>>   			++none_or_zero;
>> -			if (!userfaultfd_armed(vma) &&
>> -			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> -				continue;
>> -			} else {
>> +			if (userfaultfd_armed(vma)) {
>>   				result = SCAN_EXCEED_NONE_PTE;
>>   				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>   				goto out_unmap;
>>   			}
>> +			continue;
>>   		}
>>   		if (pte_uffd_wp(pteval)) {
>>   			/*
>> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>>   								     address)))
>>   			referenced++;
>> +
>> +		/*
>> +		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
>> +		 * pages keep track of it in the bitmap for mTHP collapsing.
>> +		 */
>> +		if (chunk_none_count < scaled_none &&
>> +			(i + 1) % MIN_MTHP_NR == 0)
>> +			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>>   	}
>> +
>>   	if (!writable) {
>>   		result = SCAN_PAGE_RO;
>>   	} else if (cc->is_khugepaged &&
>> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   out_unmap:
>>   	pte_unmap_unlock(pte, ptl);
>>   	if (result == SCAN_SUCCEED) {
>> -		result = collapse_huge_page(mm, address, referenced,
>> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>> -		/* collapse_huge_page will return with the mmap_lock released */
>> -		*mmap_locked = false;
>> +		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +			tva_flags, THP_ORDERS_ALL_ANON);
>> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
>> +			       mmap_locked, enabled_orders);
>> +		if (result > 0)
>> +			result = SCAN_SUCCEED;
>> +		else
>> +			result = SCAN_FAIL;
>>   	}
>>   out:
>>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
>> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>>   			fput(file);
>>   			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>   				mmap_read_lock(mm);
>> +				*mmap_locked = true;
>>   				if (khugepaged_test_exit_or_disable(mm))
>>   					goto end;
>>   				result = collapse_pte_mapped_thp(mm, addr,
>>   								 !cc->is_khugepaged);
>>   				mmap_read_unlock(mm);
>> +				*mmap_locked = false;
>>   			}
>>   		} else {
>>   			result = khugepaged_scan_pmd(mm, vma, addr,
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-03-03 19:13     ` Nico Pache
@ 2025-03-05  9:11       ` Dev Jain
  0 siblings, 0 replies; 55+ messages in thread
From: Dev Jain @ 2025-03-05  9:11 UTC (permalink / raw)
  To: Nico Pache, Ryan Roberts
  Cc: linux-kernel, linux-trace-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm, rostedt, mathieu.desnoyers, tiwai



On 04/03/25 12:43 am, Nico Pache wrote:
> On Wed, Feb 19, 2025 at 9:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 11/02/2025 00:30, Nico Pache wrote:
>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>> While scanning a PMD range for potential collapse candidates, keep track
>>> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
>>> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
>>> of max_ptes_none during the scan phase so we dont bailout early and miss
>>> potential mTHP candidates.
>>>
>>> After the scan is complete we will perform binary recursion on the
>>> bitmap to determine which mTHP size would be most efficient to collapse
>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>> determine how full a THP must be to be eligible.
>>>
>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>> pages, we dont perform the collapse.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>   mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>>>   1 file changed, 83 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index c8048d9ec7fb..cd310989725b 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>   {
>>>        LIST_HEAD(compound_pagelist);
>>>        pmd_t *pmd, _pmd;
>>> -     pte_t *pte;
>>> +     pte_t *pte, mthp_pte;
>>>        pgtable_t pgtable;
>>>        struct folio *folio;
>>>        spinlock_t *pmd_ptl, *pte_ptl;
>>>        int result = SCAN_FAIL;
>>>        struct vm_area_struct *vma;
>>>        struct mmu_notifier_range range;
>>> +     unsigned long _address = address + offset * PAGE_SIZE;
>>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>>
>>>        /*
>>> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                *mmap_locked = false;
>>>        }
>>>
>>> -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>> +     result = alloc_charge_folio(&folio, mm, cc, order);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out_nolock;
>>>
>>>        mmap_read_lock(mm);
>>> -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>>> +     *mmap_locked = true;
>>> +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>>        if (result != SCAN_SUCCEED) {
>>>                mmap_read_unlock(mm);
>>>                goto out_nolock;
>>> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                 * released when it fails. So we jump out_nolock directly in
>>>                 * that case.  Continuing to collapse causes inconsistency.
>>>                 */
>>> -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>>> -                             referenced, HPAGE_PMD_ORDER);
>>> +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
>>> +                             referenced, order);
>>>                if (result != SCAN_SUCCEED)
>>>                        goto out_nolock;
>>>        }
>>>
>>>        mmap_read_unlock(mm);
>>> +     *mmap_locked = false;
>>>        /*
>>>         * Prevent all access to pagetables with the exception of
>>>         * gup_fast later handled by the ptep_clear_flush and the VM
>>> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * mmap_lock.
>>>         */
>>>        mmap_write_lock(mm);
>>> -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>>> +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out_up_write;
>>>        /* check if the pmd is still valid */
>>> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        vma_start_write(vma);
>>>        anon_vma_lock_write(vma->anon_vma);
>>>
>>> -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -                             address + HPAGE_PMD_SIZE);
>>> +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>> +                             _address + (PAGE_SIZE << order));
>>>        mmu_notifier_invalidate_range_start(&range);
>>>
>>>        pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> +
>>>        /*
>>>         * This removes any huge TLB entry from the CPU so we won't allow
>>>         * huge and small TLB entries for the same virtual address to
>>> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        mmu_notifier_invalidate_range_end(&range);
>>>        tlb_remove_table_sync_one();
>>>
>>> -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>>>        if (pte) {
>>> -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
>>> -                                     &compound_pagelist, HPAGE_PMD_ORDER);
>>> +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
>>> +                                     &compound_pagelist, order);
>>>                spin_unlock(pte_ptl);
>>>        } else {
>>>                result = SCAN_PMD_NULL;
>>> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        anon_vma_unlock_write(vma->anon_vma);
>>>
>>>        result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>> -                                        vma, address, pte_ptl,
>>> -                                        &compound_pagelist, HPAGE_PMD_ORDER);
>>> +                                        vma, _address, pte_ptl,
>>> +                                        &compound_pagelist, order);
>>>        pte_unmap(pte);
>>>        if (unlikely(result != SCAN_SUCCEED))
>>>                goto out_up_write;
>>> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * write.
>>>         */
>>>        __folio_mark_uptodate(folio);
>>> -     pgtable = pmd_pgtable(_pmd);
>>> -
>>> -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>>> -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>> -
>>> -     spin_lock(pmd_ptl);
>>> -     BUG_ON(!pmd_none(*pmd));
>>> -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>> -     folio_add_lru_vma(folio, vma);
>>> -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> -     set_pmd_at(mm, address, pmd, _pmd);
>>> -     update_mmu_cache_pmd(vma, address, pmd);
>>> -     deferred_split_folio(folio, false);
>>> -     spin_unlock(pmd_ptl);
>>> +     if (order == HPAGE_PMD_ORDER) {
>>> +             pgtable = pmd_pgtable(_pmd);
>>> +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>>> +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>>> +             BUG_ON(!pmd_none(*pmd));
>>> +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>>> +             folio_add_lru_vma(folio, vma);
>>> +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> +             set_pmd_at(mm, address, pmd, _pmd);
>>> +             update_mmu_cache_pmd(vma, address, pmd);
>>> +             deferred_split_folio(folio, false);
>>> +             spin_unlock(pmd_ptl);
>>> +     } else { //mTHP
>>> +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
>>> +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>>> +             folio_ref_add(folio, (1 << order) - 1);
>>> +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>>> +             folio_add_lru_vma(folio, vma);
>>> +             spin_lock(pte_ptl);
>>> +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
>>> +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
>>> +             spin_unlock(pte_ptl);
>>> +             smp_wmb(); /* make pte visible before pmd */
>>> +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>> +             deferred_split_folio(folio, false);
>>> +             spin_unlock(pmd_ptl);
>>
>> I've only stared at this briefly, but it feels like there might be some bugs:
> 
> Sorry for the delayed response, I needed to catch up on some other
> work and wanted to make sure I looked into your questions before
> answering.
>>
>>   - Why are you taking the pmd ptl? and calling pmd_populate? Surely the pte
>> table already exists and is attached to the pmd? So we are only need to update
>> the pte entries here? Or perhaps the whole pmd was previously isolated?
> 
> The previous locking behavior is kept; however, because we are not
> installing a NEW pmd we need to repopulate the old PMD (like we do in
> the fail case). The PMD entry was cleared to avoid GUP-fast races.
>>
>>   - I think some arches use a single PTL for all levels of the pgtable? So in
>> this case it's probably not a good idea to nest the pmd and pte spin lock?
> 
> Thanks for pointing that out, I corrected it by making sure they dont nest!
> 
>>
>>   - Given the pte PTL is dropped then reacquired, is there any way that the ptes
>> could have changed under us? Is any revalidation required? Perhaps not if pte
>> table was removed from the PMD.
> 
> Correct, I believe we dont even need to take the PTL because of all
> the write locks we took-- but for now i'm trying to keep the locking
> changes to a minimum. We can focus on locking optimizations later.

Even the current code is sprinkled with comments like "Probably 
unnecessary" when, for example taking the PTL around set_ptes(). IMHO 
let us follow the same logic for now, and we can think about dropping 
the spinlocks later, so I will prefer taking the PMD-PTL around 
pmd_populate(), and take the PTE-PTL around set_ptes().

> 
>>
>>   - I would have guessed the memory ordering you want from smp_wmb() would
>> already be handled by the spin_unlock()?
> 
> Yes I think that is correct, I noticed other callers doing this, but
> on a second pass those are all lockless, so in this case we dont need
> it.
> 
>>
>>
>>> +     }
>>>
>>>        folio = NULL;
>>>
>>> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>   {
>>>        pmd_t *pmd;
>>>        pte_t *pte, *_pte;
>>> +     int i;
>>>        int result = SCAN_FAIL, referenced = 0;
>>>        int none_or_zero = 0, shared = 0;
>>>        struct page *page = NULL;
>>>        struct folio *folio = NULL;
>>>        unsigned long _address;
>>> +     unsigned long enabled_orders;
>>>        spinlock_t *ptl;
>>>        int node = NUMA_NO_NODE, unmapped = 0;
>>>        bool writable = false;
>>> -
>>> +     int chunk_none_count = 0;
>>> +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
>>> +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>>
>>>        result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out;
>>>
>>> +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
>>> +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>>>        memset(cc->node_load, 0, sizeof(cc->node_load));
>>>        nodes_clear(cc->alloc_nmask);
>>>        pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                goto out;
>>>        }
>>>
>>> -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> -          _pte++, _address += PAGE_SIZE) {
>>> +     for (i = 0; i < HPAGE_PMD_NR; i++) {
>>> +             if (i % MIN_MTHP_NR == 0)
>>> +                     chunk_none_count = 0;
>>> +
>>> +             _pte = pte + i;
>>> +             _address = address + i * PAGE_SIZE;
>>>                pte_t pteval = ptep_get(_pte);
>>>                if (is_swap_pte(pteval)) {
>>>                        ++unmapped;
>>> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                        }
>>>                }
>>>                if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> +                     ++chunk_none_count;
>>>                        ++none_or_zero;
>>> -                     if (!userfaultfd_armed(vma) &&
>>> -                         (!cc->is_khugepaged ||
>>> -                          none_or_zero <= khugepaged_max_ptes_none)) {
>>> -                             continue;
>>> -                     } else {
>>> +                     if (userfaultfd_armed(vma)) {
>>>                                result = SCAN_EXCEED_NONE_PTE;
>>>                                count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>>                                goto out_unmap;
>>>                        }
>>> +                     continue;
>>>                }
>>>                if (pte_uffd_wp(pteval)) {
>>>                        /*
>>> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>>>                                                                     address)))
>>>                        referenced++;
>>> +
>>> +             /*
>>> +              * we are reading in MIN_MTHP_NR page chunks. if there are no empty
>>> +              * pages keep track of it in the bitmap for mTHP collapsing.
>>> +              */
>>> +             if (chunk_none_count < scaled_none &&
>>> +                     (i + 1) % MIN_MTHP_NR == 0)
>>> +                     bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>>>        }
>>> +
>>>        if (!writable) {
>>>                result = SCAN_PAGE_RO;
>>>        } else if (cc->is_khugepaged &&
>>> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>   out_unmap:
>>>        pte_unmap_unlock(pte, ptl);
>>>        if (result == SCAN_SUCCEED) {
>>> -             result = collapse_huge_page(mm, address, referenced,
>>> -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>>> -             /* collapse_huge_page will return with the mmap_lock released */
>>> -             *mmap_locked = false;
>>> +             enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>>> +                     tva_flags, THP_ORDERS_ALL_ANON);
>>> +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
>>> +                            mmap_locked, enabled_orders);
>>> +             if (result > 0)
>>> +                     result = SCAN_SUCCEED;
>>> +             else
>>> +                     result = SCAN_FAIL;
>>>        }
>>>   out:
>>>        trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
>>> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>>>                        fput(file);
>>>                        if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>>                                mmap_read_lock(mm);
>>> +                             *mmap_locked = true;
>>>                                if (khugepaged_test_exit_or_disable(mm))
>>>                                        goto end;
>>>                                result = collapse_pte_mapped_thp(mm, addr,
>>>                                                                 !cc->is_khugepaged);
>>>                                mmap_read_unlock(mm);
>>> +                             *mmap_locked = false;
>>>                        }
>>>                } else {
>>>                        result = khugepaged_scan_pmd(mm, vma, addr,
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
                     ` (3 preceding siblings ...)
  2025-02-19 16:52   ` Ryan Roberts
@ 2025-03-07  6:38   ` Dev Jain
  2025-03-07 20:14     ` Nico Pache
  4 siblings, 1 reply; 55+ messages in thread
From: Dev Jain @ 2025-03-07  6:38 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-trace-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm, rostedt,
	mathieu.desnoyers, tiwai



On 11/02/25 6:00 am, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning a PMD range for potential collapse candidates, keep track
> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> of max_ptes_none during the scan phase so we dont bailout early and miss
> potential mTHP candidates.
> 
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
> 
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>   1 file changed, 83 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c8048d9ec7fb..cd310989725b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   {
>   	LIST_HEAD(compound_pagelist);
>   	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;
>   	pgtable_t pgtable;
>   	struct folio *folio;
>   	spinlock_t *pmd_ptl, *pte_ptl;
>   	int result = SCAN_FAIL;
>   	struct vm_area_struct *vma;
>   	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
>   	/*
> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   		*mmap_locked = false;
>   	}
>   
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>   	if (result != SCAN_SUCCEED)
>   		goto out_nolock;
>   
>   	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>   	if (result != SCAN_SUCCEED) {
>   		mmap_read_unlock(mm);
>   		goto out_nolock;
> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   		 * released when it fails. So we jump out_nolock directly in
>   		 * that case.  Continuing to collapse causes inconsistency.
>   		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>   		if (result != SCAN_SUCCEED)
>   			goto out_nolock;
>   	}
>   
>   	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>   	/*
>   	 * Prevent all access to pagetables with the exception of
>   	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * mmap_lock.
>   	 */
>   	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   	/* check if the pmd is still valid */
> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	vma_start_write(vma);
>   	anon_vma_lock_write(vma->anon_vma);
>   
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));
>   	mmu_notifier_invalidate_range_start(&range);
>   
>   	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>   	/*
>   	 * This removes any huge TLB entry from the CPU so we won't allow
>   	 * huge and small TLB entries for the same virtual address to
> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	mmu_notifier_invalidate_range_end(&range);
>   	tlb_remove_table_sync_one();
>   
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>   	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>   		spin_unlock(pte_ptl);
>   	} else {
>   		result = SCAN_PMD_NULL;
> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	anon_vma_unlock_write(vma->anon_vma);
>   
>   	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>   	pte_unmap(pte);
>   	if (unlikely(result != SCAN_SUCCEED))
>   		goto out_up_write;
> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * write.
>   	 */
>   	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { //mTHP
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);

Please add a BUG_ON(!pmd_none(*pmd)) here. I hit this a lot of times 
when I was generalizing for VMA size.

> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		spin_lock(pte_ptl);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +		spin_unlock(pte_ptl);
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	}
>   
>   	folio = NULL;
>   
> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   {
>   	pmd_t *pmd;
>   	pte_t *pte, *_pte;
> +	int i;
>   	int result = SCAN_FAIL, referenced = 0;
>   	int none_or_zero = 0, shared = 0;
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
>   	unsigned long _address;
> +	unsigned long enabled_orders;
>   	spinlock_t *ptl;
>   	int node = NUMA_NO_NODE, unmapped = 0;
>   	bool writable = false;
> -
> +	int chunk_none_count = 0;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>   	if (result != SCAN_SUCCEED)
>   		goto out;
>   
> +	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>   	nodes_clear(cc->alloc_nmask);
>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		goto out;
>   	}
>   
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, _address += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		if (i % MIN_MTHP_NR == 0)
> +			chunk_none_count = 0;
> +
> +		_pte = pte + i;
> +		_address = address + i * PAGE_SIZE;
>   		pte_t pteval = ptep_get(_pte);
>   		if (is_swap_pte(pteval)) {
>   			++unmapped;
> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   			}
>   		}
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			++chunk_none_count;
>   			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (userfaultfd_armed(vma)) {
>   				result = SCAN_EXCEED_NONE_PTE;
>   				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>   				goto out_unmap;
>   			}
> +			continue;
>   		}
>   		if (pte_uffd_wp(pteval)) {
>   			/*
> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>   								     address)))
>   			referenced++;
> +
> +		/*
> +		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> +		 * pages keep track of it in the bitmap for mTHP collapsing.
> +		 */
> +		if (chunk_none_count < scaled_none &&
> +			(i + 1) % MIN_MTHP_NR == 0)
> +			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>   	}
> +
>   	if (!writable) {
>   		result = SCAN_PAGE_RO;
>   	} else if (cc->is_khugepaged &&
> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   out_unmap:
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			tva_flags, THP_ORDERS_ALL_ANON);
> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> +			       mmap_locked, enabled_orders);
> +		if (result > 0)
> +			result = SCAN_SUCCEED;
> +		else
> +			result = SCAN_FAIL;
>   	}
>   out:
>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>   			fput(file);
>   			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>   				mmap_read_lock(mm);
> +				*mmap_locked = true;
>   				if (khugepaged_test_exit_or_disable(mm))
>   					goto end;
>   				result = collapse_pte_mapped_thp(mm, addr,
>   								 !cc->is_khugepaged);
>   				mmap_read_unlock(mm);
> +				*mmap_locked = false;
>   			}
>   		} else {
>   			result = khugepaged_scan_pmd(mm, vma, addr,



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-03-07  6:38   ` Dev Jain
@ 2025-03-07 20:14     ` Nico Pache
  2025-03-10  4:17       ` Dev Jain
  0 siblings, 1 reply; 55+ messages in thread
From: Nico Pache @ 2025-03-07 20:14 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai

On Fri, Mar 7, 2025 at 1:39 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/02/25 6:00 am, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning a PMD range for potential collapse candidates, keep track
> > of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
> > utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
> > of max_ptes_none during the scan phase so we dont bailout early and miss
> > potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
> >   1 file changed, 83 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c8048d9ec7fb..cd310989725b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >   {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       /*
> > @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
> > @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> > @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       pte_unmap(pte);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> > @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { //mTHP
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
>
> Please add a BUG_ON(!pmd_none(*pmd)) here. I hit this a lot of times
> when I was generalizing for VMA size.

The PMD has been removed until the PMD_populate. The BUG_ON will
always be hit if we do this.
>
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             spin_lock(pte_ptl);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +             spin_unlock(pte_ptl);
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     }
> >
> >       folio = NULL;
> >
> > @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >   {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > +     int i;
> >       int result = SCAN_FAIL, referenced = 0;
> >       int none_or_zero = 0, shared = 0;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long _address;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >       bool writable = false;
> > -
> > +     int chunk_none_count = 0;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, _address += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             if (i % MIN_MTHP_NR == 0)
> > +                     chunk_none_count = 0;
> > +
> > +             _pte = pte + i;
> > +             _address = address + i * PAGE_SIZE;
> >               pte_t pteval = ptep_get(_pte);
> >               if (is_swap_pte(pteval)) {
> >                       ++unmapped;
> > @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +                     ++chunk_none_count;
> >                       ++none_or_zero;
> > -                     if (!userfaultfd_armed(vma) &&
> > -                         (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > -                             continue;
> > -                     } else {
> > +                     if (userfaultfd_armed(vma)) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >                               goto out_unmap;
> >                       }
> > +                     continue;
> >               }
> >               if (pte_uffd_wp(pteval)) {
> >                       /*
> > @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                    folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
> >                                                                    address)))
> >                       referenced++;
> > +
> > +             /*
> > +              * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> > +              * pages keep track of it in the bitmap for mTHP collapsing.
> > +              */
> > +             if (chunk_none_count < scaled_none &&
> > +                     (i + 1) % MIN_MTHP_NR == 0)
> > +                     bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
> >       }
> > +
> >       if (!writable) {
> >               result = SCAN_PAGE_RO;
> >       } else if (cc->is_khugepaged &&
> > @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >   out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > -             result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +             enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +                     tva_flags, THP_ORDERS_ALL_ANON);
> > +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> > +                            mmap_locked, enabled_orders);
> > +             if (result > 0)
> > +                     result = SCAN_SUCCEED;
> > +             else
> > +                     result = SCAN_FAIL;
> >       }
> >   out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> > @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
> >                       fput(file);
> >                       if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >                               mmap_read_lock(mm);
> > +                             *mmap_locked = true;
> >                               if (khugepaged_test_exit_or_disable(mm))
> >                                       goto end;
> >                               result = collapse_pte_mapped_thp(mm, addr,
> >                                                                !cc->is_khugepaged);
> >                               mmap_read_unlock(mm);
> > +                             *mmap_locked = false;
> >                       }
> >               } else {
> >                       result = khugepaged_scan_pmd(mm, vma, addr,
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 7/9] khugepaged: add mTHP support
  2025-03-07 20:14     ` Nico Pache
@ 2025-03-10  4:17       ` Dev Jain
  0 siblings, 0 replies; 55+ messages in thread
From: Dev Jain @ 2025-03-10  4:17 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-trace-kernel, linux-mm, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	sunnanyong, usamaarif642, audra, akpm, rostedt, mathieu.desnoyers,
	tiwai



On 08/03/25 1:44 am, Nico Pache wrote:
> On Fri, Mar 7, 2025 at 1:39 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 11/02/25 6:00 am, Nico Pache wrote:
>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>> While scanning a PMD range for potential collapse candidates, keep track
>>> of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a
>>> utilized region of order MIN_MTHP_ORDER ptes. We remove the restriction
>>> of max_ptes_none during the scan phase so we dont bailout early and miss
>>> potential mTHP candidates.
>>>
>>> After the scan is complete we will perform binary recursion on the
>>> bitmap to determine which mTHP size would be most efficient to collapse
>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>> determine how full a THP must be to be eligible.
>>>
>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>> pages, we dont perform the collapse.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>    mm/khugepaged.c | 122 ++++++++++++++++++++++++++++++++----------------
>>>    1 file changed, 83 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index c8048d9ec7fb..cd310989725b 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1127,13 +1127,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>    {
>>>        LIST_HEAD(compound_pagelist);
>>>        pmd_t *pmd, _pmd;
>>> -     pte_t *pte;
>>> +     pte_t *pte, mthp_pte;
>>>        pgtable_t pgtable;
>>>        struct folio *folio;
>>>        spinlock_t *pmd_ptl, *pte_ptl;
>>>        int result = SCAN_FAIL;
>>>        struct vm_area_struct *vma;
>>>        struct mmu_notifier_range range;
>>> +     unsigned long _address = address + offset * PAGE_SIZE;
>>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>>
>>>        /*
>>> @@ -1148,12 +1149,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                *mmap_locked = false;
>>>        }
>>>
>>> -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>> +     result = alloc_charge_folio(&folio, mm, cc, order);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out_nolock;
>>>
>>>        mmap_read_lock(mm);
>>> -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>>> +     *mmap_locked = true;
>>> +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>>        if (result != SCAN_SUCCEED) {
>>>                mmap_read_unlock(mm);
>>>                goto out_nolock;
>>> @@ -1171,13 +1173,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                 * released when it fails. So we jump out_nolock directly in
>>>                 * that case.  Continuing to collapse causes inconsistency.
>>>                 */
>>> -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>>> -                             referenced, HPAGE_PMD_ORDER);
>>> +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
>>> +                             referenced, order);
>>>                if (result != SCAN_SUCCEED)
>>>                        goto out_nolock;
>>>        }
>>>
>>>        mmap_read_unlock(mm);
>>> +     *mmap_locked = false;
>>>        /*
>>>         * Prevent all access to pagetables with the exception of
>>>         * gup_fast later handled by the ptep_clear_flush and the VM
>>> @@ -1187,7 +1190,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * mmap_lock.
>>>         */
>>>        mmap_write_lock(mm);
>>> -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>>> +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out_up_write;
>>>        /* check if the pmd is still valid */
>>> @@ -1198,11 +1201,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        vma_start_write(vma);
>>>        anon_vma_lock_write(vma->anon_vma);
>>>
>>> -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -                             address + HPAGE_PMD_SIZE);
>>> +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>> +                             _address + (PAGE_SIZE << order));
>>>        mmu_notifier_invalidate_range_start(&range);
>>>
>>>        pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> +
>>>        /*
>>>         * This removes any huge TLB entry from the CPU so we won't allow
>>>         * huge and small TLB entries for the same virtual address to
>>> @@ -1216,10 +1220,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        mmu_notifier_invalidate_range_end(&range);
>>>        tlb_remove_table_sync_one();
>>>
>>> -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>>>        if (pte) {
>>> -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
>>> -                                     &compound_pagelist, HPAGE_PMD_ORDER);
>>> +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
>>> +                                     &compound_pagelist, order);
>>>                spin_unlock(pte_ptl);
>>>        } else {
>>>                result = SCAN_PMD_NULL;
>>> @@ -1248,8 +1252,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        anon_vma_unlock_write(vma->anon_vma);
>>>
>>>        result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>> -                                        vma, address, pte_ptl,
>>> -                                        &compound_pagelist, HPAGE_PMD_ORDER);
>>> +                                        vma, _address, pte_ptl,
>>> +                                        &compound_pagelist, order);
>>>        pte_unmap(pte);
>>>        if (unlikely(result != SCAN_SUCCEED))
>>>                goto out_up_write;
>>> @@ -1260,20 +1264,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * write.
>>>         */
>>>        __folio_mark_uptodate(folio);
>>> -     pgtable = pmd_pgtable(_pmd);
>>> -
>>> -     _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>>> -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>> -
>>> -     spin_lock(pmd_ptl);
>>> -     BUG_ON(!pmd_none(*pmd));
>>> -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>> -     folio_add_lru_vma(folio, vma);
>>> -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> -     set_pmd_at(mm, address, pmd, _pmd);
>>> -     update_mmu_cache_pmd(vma, address, pmd);
>>> -     deferred_split_folio(folio, false);
>>> -     spin_unlock(pmd_ptl);
>>> +     if (order == HPAGE_PMD_ORDER) {
>>> +             pgtable = pmd_pgtable(_pmd);
>>> +             _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
>>> +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>>> +             BUG_ON(!pmd_none(*pmd));
>>> +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>>> +             folio_add_lru_vma(folio, vma);
>>> +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> +             set_pmd_at(mm, address, pmd, _pmd);
>>> +             update_mmu_cache_pmd(vma, address, pmd);
>>> +             deferred_split_folio(folio, false);
>>> +             spin_unlock(pmd_ptl);
>>> +     } else { //mTHP
>>> +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
>>> +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>>
>> Please add a BUG_ON(!pmd_none(*pmd)) here. I hit this a lot of times
>> when I was generalizing for VMA size.
> 
> The PMD has been removed until the PMD_populate. The BUG_ON will
> always be hit if we do this.

That's why it is !pmd_none(*pmd).

>>
>>> +             folio_ref_add(folio, (1 << order) - 1);
>>> +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>>> +             folio_add_lru_vma(folio, vma);
>>> +             spin_lock(pte_ptl);
>>> +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
>>> +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
>>> +             spin_unlock(pte_ptl);
>>> +             smp_wmb(); /* make pte visible before pmd */
>>> +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>> +             deferred_split_folio(folio, false);
>>> +             spin_unlock(pmd_ptl);
>>> +     }
>>>
>>>        folio = NULL;
>>>
>>> @@ -1353,21 +1374,27 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>    {
>>>        pmd_t *pmd;
>>>        pte_t *pte, *_pte;
>>> +     int i;
>>>        int result = SCAN_FAIL, referenced = 0;
>>>        int none_or_zero = 0, shared = 0;
>>>        struct page *page = NULL;
>>>        struct folio *folio = NULL;
>>>        unsigned long _address;
>>> +     unsigned long enabled_orders;
>>>        spinlock_t *ptl;
>>>        int node = NUMA_NO_NODE, unmapped = 0;
>>>        bool writable = false;
>>> -
>>> +     int chunk_none_count = 0;
>>> +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - MIN_MTHP_ORDER);
>>> +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>>
>>>        result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>>        if (result != SCAN_SUCCEED)
>>>                goto out;
>>>
>>> +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
>>> +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>>>        memset(cc->node_load, 0, sizeof(cc->node_load));
>>>        nodes_clear(cc->alloc_nmask);
>>>        pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>> @@ -1376,8 +1403,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                goto out;
>>>        }
>>>
>>> -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> -          _pte++, _address += PAGE_SIZE) {
>>> +     for (i = 0; i < HPAGE_PMD_NR; i++) {
>>> +             if (i % MIN_MTHP_NR == 0)
>>> +                     chunk_none_count = 0;
>>> +
>>> +             _pte = pte + i;
>>> +             _address = address + i * PAGE_SIZE;
>>>                pte_t pteval = ptep_get(_pte);
>>>                if (is_swap_pte(pteval)) {
>>>                        ++unmapped;
>>> @@ -1400,16 +1431,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                        }
>>>                }
>>>                if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> +                     ++chunk_none_count;
>>>                        ++none_or_zero;
>>> -                     if (!userfaultfd_armed(vma) &&
>>> -                         (!cc->is_khugepaged ||
>>> -                          none_or_zero <= khugepaged_max_ptes_none)) {
>>> -                             continue;
>>> -                     } else {
>>> +                     if (userfaultfd_armed(vma)) {
>>>                                result = SCAN_EXCEED_NONE_PTE;
>>>                                count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>>                                goto out_unmap;
>>>                        }
>>> +                     continue;
>>>                }
>>>                if (pte_uffd_wp(pteval)) {
>>>                        /*
>>> @@ -1500,7 +1529,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>>>                                                                     address)))
>>>                        referenced++;
>>> +
>>> +             /*
>>> +              * we are reading in MIN_MTHP_NR page chunks. if there are no empty
>>> +              * pages keep track of it in the bitmap for mTHP collapsing.
>>> +              */
>>> +             if (chunk_none_count < scaled_none &&
>>> +                     (i + 1) % MIN_MTHP_NR == 0)
>>> +                     bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>>>        }
>>> +
>>>        if (!writable) {
>>>                result = SCAN_PAGE_RO;
>>>        } else if (cc->is_khugepaged &&
>>> @@ -1513,10 +1551,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>    out_unmap:
>>>        pte_unmap_unlock(pte, ptl);
>>>        if (result == SCAN_SUCCEED) {
>>> -             result = collapse_huge_page(mm, address, referenced,
>>> -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>>> -             /* collapse_huge_page will return with the mmap_lock released */
>>> -             *mmap_locked = false;
>>> +             enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>>> +                     tva_flags, THP_ORDERS_ALL_ANON);
>>> +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
>>> +                            mmap_locked, enabled_orders);
>>> +             if (result > 0)
>>> +                     result = SCAN_SUCCEED;
>>> +             else
>>> +                     result = SCAN_FAIL;
>>>        }
>>>    out:
>>>        trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
>>> @@ -2476,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>>>                        fput(file);
>>>                        if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>>                                mmap_read_lock(mm);
>>> +                             *mmap_locked = true;
>>>                                if (khugepaged_test_exit_or_disable(mm))
>>>                                        goto end;
>>>                                result = collapse_pte_mapped_thp(mm, addr,
>>>                                                                 !cc->is_khugepaged);
>>>                                mmap_read_unlock(mm);
>>> +                             *mmap_locked = false;
>>>                        }
>>>                } else {
>>>                        result = khugepaged_scan_pmd(mm, vma, addr,
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-03-10  4:18 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-11  0:30 [RFC v2 0/9] khugepaged: mTHP support Nico Pache
2025-02-11  0:30 ` [RFC v2 1/9] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-02-17 17:11   ` Usama Arif
2025-02-17 19:56     ` Nico Pache
2025-02-18 16:26   ` Ryan Roberts
2025-02-18 22:24     ` Nico Pache
2025-02-11  0:30 ` [RFC v2 2/9] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-02-18 16:29   ` Ryan Roberts
2025-02-11  0:30 ` [RFC v2 3/9] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-02-11  0:30 ` [RFC v2 4/9] khugepaged: generalize alloc_charge_folio " Nico Pache
2025-02-19 15:29   ` Ryan Roberts
2025-02-11  0:30 ` [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* " Nico Pache
2025-02-19 15:39   ` Ryan Roberts
2025-02-19 16:02     ` Nico Pache
2025-02-11  0:30 ` [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-02-17  7:27   ` Dev Jain
2025-02-17 19:12   ` Usama Arif
2025-02-19 16:28   ` Ryan Roberts
2025-02-20 18:48     ` Nico Pache
2025-02-11  0:30 ` [RFC v2 7/9] khugepaged: add " Nico Pache
2025-02-12 17:04   ` Usama Arif
2025-02-12 18:16     ` Nico Pache
2025-02-17 20:55   ` Usama Arif
2025-02-17 21:22     ` Nico Pache
2025-02-18  4:22   ` Dev Jain
2025-03-03 19:18     ` Nico Pache
2025-03-04  5:10       ` Dev Jain
2025-02-19 16:52   ` Ryan Roberts
2025-03-03 19:13     ` Nico Pache
2025-03-05  9:11       ` Dev Jain
2025-03-05  9:07     ` Dev Jain
2025-03-07  6:38   ` Dev Jain
2025-03-07 20:14     ` Nico Pache
2025-03-10  4:17       ` Dev Jain
2025-02-11  0:30 ` [RFC v2 8/9] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-02-11  0:30 ` [RFC v2 9/9] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-02-19 16:57   ` Ryan Roberts
2025-02-11 12:49 ` [RFC v2 0/9] khugepaged: mTHP support Dev Jain
2025-02-12 16:49   ` Nico Pache
2025-02-13  8:26     ` Dev Jain
2025-02-13 11:21       ` Dev Jain
2025-02-13 19:39       ` Nico Pache
2025-02-14  2:01         ` Dev Jain
2025-02-15  0:52           ` Nico Pache
2025-02-15  6:38             ` Dev Jain
2025-02-17  8:05               ` Dev Jain
2025-02-17 19:19                 ` Nico Pache
2025-02-17  6:39 ` Dev Jain
2025-02-17 19:15   ` Nico Pache
2025-02-18 16:07 ` Ryan Roberts
2025-02-18 22:30   ` Nico Pache
2025-02-19  9:01     ` Dev Jain
2025-02-20 19:12       ` Nico Pache
2025-02-21  4:57         ` Dev Jain
2025-02-19 17:00 ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).