[PATCH v7 00/12] khugepaged: mTHP support

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/12] khugepaged: mTHP support
@ 2025-05-15  3:22 Nico Pache
  2025-05-15  3:22 ` [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
                   ` (14 more replies)
  0 siblings, 15 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

The following series provides khugepaged and madvise collapse with the
capability to collapse anonymous memory regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. This info is
tracked using a bitmap. After the PMD scan is done, we do binary recursion
on the bitmap to find the optimal mTHP sizes for the PMD range. The
restriction on max_ptes_none is removed during the scan, to make sure we
account for the whole PMD range. When no mTHP size is enabled, the legacy
behavior of khugepaged is maintained. max_ptes_none will be scaled by the
attempted collapse order to determine how full a THP must be to be
eligible. If a mTHP collapse is attempted, but contains swapped out, or
shared pages, we dont perform the collapse.

With the default max_ptes_none=511, the code should keep its most of its
original behavior. To exercise mTHP collapse we need to set
max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
experience collapse "creep" and constantly promote mTHPs to the next
available size. This is due the fact that it will introduce at least 2x
the number of pages, and on a future scan will satisfy that condition once
again.

Patch 1:     Refactor/rename hpage_collapse
Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
Patch 3-5:   Generalize khugepaged functions for arbitrary orders
Patch 6-9:   The mTHP patches
Patch 10-11: Tracing/stats
Patch 12:    Documentation

---------
 Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits
   while monitoring a number of stats and tracepoints. The code is
   available here[1] (Run in legacy mode for these changes and set mthp
   sizes to inherit)
   The summary from my testings was that there was no significant
   regression noticed through this test. In some cases my changes had
   better collapse latencies, and was able to scan more pages in the same
   amount of time/work, but for the most part the results were consistent.
- redis testing. I tested these changes along with my defer changes
  (see followup post for more details).
- some basic testing on 64k page size.
- lots of general use.

V6 Changes:
- Dont release the anon_vma_lock early (like in the PMD case), as not all
  pages are isolated.
- Define the PTE as null to avoid a uninitilized condition
- minor nits and newline cleanup
- make sure to unmap and unlock the pte for the swapin case
- change the revalidation to always check the PMD order (as this will make
  sure that no other VMA spans it)

V5 Changes [2]:
- switched the order of patches 1 and 2
- fixed some edge cases on the unified madvise_collapse and khugepaged
- Explained the "creep" some more in the docs
- fix EXCEED_SHARED vs EXCEED_SWAP accounting issue
- fix potential highmem issue caused by a early unmap of the PTE

V4 Changes:
- Rebased onto mm-unstable
- small changes to Documentation

V3 Changes:
- corrected legacy behavior for khugepaged and madvise_collapse
- added proper mTHP stat tracking
- Minor changes to prevent a nested lock on non-split-lock arches
- Took Devs version of alloc_charge_folio as it has the proper stats
- Skip cases were trying to collapse to a lower order would still fail
- Fixed cases were the bitmap was not being updated properly
- Moved Documentation update to this series instead of the defer set
- Minor bugs discovered during testing and review
- Minor "nit" cleanup

V2 Changes:
- Minor bug fixes discovered during review and testing
- removed dynamic allocations for bitmaps, and made them stack based
- Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
- Updated trace events to include collapsing order info.
- Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
- No longer require a chunk to be fully utilized before setting the bit.
   Use the same max_ptes_none scaling principle to achieve this.
- Skip mTHP collapse that requires swapin or shared handling. This helps
   prevent some of the "creep" that was discovered in v1.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/all/20250428181218.85925-1-npache@redhat.com/

Dev Jain (1):
  khugepaged: generalize alloc_charge_folio()

Nico Pache (11):
  khugepaged: rename hpage_collapse_* to khugepaged_*
  introduce khugepaged_collapse_single_pmd to unify khugepaged and
    madvise_collapse
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: skip collapsing mTHP to smaller orders
  khugepaged: avoid unnecessary mTHP collapse attempts
  khugepaged: improve tracepoints for mTHP orders
  khugepaged: add per-order mTHP khugepaged stats
  Documentation: mm: update the admin guide for mTHP collapse

 Documentation/admin-guide/mm/transhuge.rst |  14 +-
 include/linux/huge_mm.h                    |   5 +
 include/linux/khugepaged.h                 |   4 +
 include/trace/events/huge_memory.h         |  34 +-
 mm/huge_memory.c                           |  11 +
 mm/khugepaged.c                            | 472 ++++++++++++++-------
 6 files changed, 382 insertions(+), 158 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-16 17:30   ` Liam R. Howlett
  2025-05-15  3:22 ` [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

functions in khugepaged.c use a mix of hpage_collapse and khugepaged
as the function prefix.

rename all of them to khugepaged to keep things consistent and slightly
shorten the function names.

Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cdf5a581368b..806bcd8c5185 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
 	kmem_cache_destroy(mm_slot_cache);
 }
 
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
 {
-	return hpage_collapse_test_exit(mm) ||
+	return khugepaged_test_exit(mm) ||
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
@@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 	int wakeup;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
 		return;
 
@@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * khugepaged_test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we return all
 		 * pagetables will be destroyed) until khugepaged has finished
 		 * working on the pagetables under the mmap_lock.
@@ -851,7 +851,7 @@ struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -886,7 +886,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -905,7 +905,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -925,7 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -992,7 +992,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1078,7 +1078,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
-	int node = hpage_collapse_find_target_node(cc);
+	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
 	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1264,7 +1264,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
 				   struct collapse_control *cc)
@@ -1378,7 +1378,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1447,7 +1447,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (khugepaged_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
@@ -1740,7 +1740,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 
-		if (hpage_collapse_test_exit(mm))
+		if (khugepaged_test_exit(mm))
 			continue;
 		/*
 		 * When a vma is registered with uffd-wp, we cannot recycle
@@ -2262,7 +2262,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2307,7 +2307,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2391,7 +2391,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		goto breakouterloop;
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2399,7 +2399,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+		if (unlikely(khugepaged_test_exit_or_disable(mm))) {
 			progress++;
 			break;
 		}
@@ -2421,7 +2421,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(khugepaged_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2481,7 +2481,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (khugepaged_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
  2025-05-15  3:22 ` [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  5:50   ` Baolin Wang
  2025-05-16 17:12   ` Liam R. Howlett
  2025-05-15  3:22 ` [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.

Create khugepaged_collapse_single_pmd to increase code
reuse and create an entry point for future khugepaged changes.

Refactor madvise_collapse and khugepaged_scan_mm_slot to use
the new khugepaged_collapse_single_pmd function.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 806bcd8c5185..5457571d505a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int khugepaged_collapse_single_pmd(unsigned long addr,
+				   struct vm_area_struct *vma, bool *mmap_locked,
+				   struct collapse_control *cc)
+{
+	int result = SCAN_FAIL;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
+		struct file *file = get_file(vma->vm_file);
+		pgoff_t pgoff = linear_page_index(vma, addr);
+
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+		result = khugepaged_scan_file(mm, addr, file, pgoff, cc);
+		fput(file);
+		if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+			mmap_read_lock(mm);
+			*mmap_locked = true;
+			if (khugepaged_test_exit_or_disable(mm)) {
+				result = SCAN_ANY_PROCESS;
+				goto end;
+			}
+			result = collapse_pte_mapped_thp(mm, addr,
+							 !cc->is_khugepaged);
+			if (result == SCAN_PMD_MAPPED)
+				result = SCAN_SUCCEED;
+			mmap_read_unlock(mm);
+			*mmap_locked = false;
+		}
+	} else {
+		result = khugepaged_scan_pmd(mm, vma, addr, mmap_locked, cc);
+	}
+	if (cc->is_khugepaged && result == SCAN_SUCCEED)
+		++khugepaged_pages_collapsed;
+end:
+	return result;
+}
+
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2427,34 +2469,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (!vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
-
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
 
+			*result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
+						vma, &mmap_locked, cc);
+			/* If we return SCAN_ANY_PROCESS we are holding the mmap_lock */
+			if (*result == SCAN_ANY_PROCESS)
+				goto breakouterloop;
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
@@ -2773,36 +2793,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
 		nodes_clear(cc->alloc_nmask);
-		if (!vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
-			fput(file);
-		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
-		}
+		result = khugepaged_collapse_single_pmd(addr, vma, &mmap_locked, cc);
+
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
 		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			BUG_ON(*prev);
-			mmap_read_lock(mm);
-			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
-		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
  2025-05-15  3:22 ` [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
  2025-05-15  3:22 ` [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-16 17:14   ` Liam R. Howlett
  2025-05-23  6:55   ` Baolin Wang
  2025-05-15  3:22 ` [PATCH v7 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

For khugepaged to support different mTHP orders, we must generalize this
to check if the PMD is not shared by another VMA and the order is
enabled.

No functional change in this patch.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5457571d505a..0c4d6a02d59c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   bool expect_anon,
 				   struct vm_area_struct **vmap,
-				   struct collapse_control *cc)
+				   struct collapse_control *cc, int order)
 {
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
@@ -934,7 +934,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2782,7 +2782,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 cc, HPAGE_PMD_ORDER);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 04/12] khugepaged: generalize alloc_charge_folio()
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (2 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  3:22 ` [PATCH v7 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

From: Dev Jain <dev.jain@arm.com>

Pass order to alloc_charge_folio() and update mTHP statistics.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/huge_mm.h |  2 ++
 mm/huge_memory.c        |  4 ++++
 mm/khugepaged.c         | 17 +++++++++++------
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..0bb65bd4e6dd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -123,6 +123,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_COLLAPSE_ALLOC,
+	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..177f0a78666a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -680,6 +682,8 @@ static struct attribute *any_stats_attrs[] = {
 #endif
 	&split_attr.attr,
 	&split_failed_attr.attr,
+	&collapse_alloc_attr.attr,
+	&collapse_alloc_failed_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0c4d6a02d59c..cf94ccdfe751 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,21 +1074,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, u8 order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (order == HPAGE_PMD_ORDER)
+			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
-	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (order == HPAGE_PMD_ORDER)
+		count_vm_event(THP_COLLAPSE_ALLOC);
+	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
@@ -1125,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1847,7 +1852,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (3 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  3:22 ` [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.

mTHP collapse can suffer from incosistant behavior, and memory waste
"creep". disable swapin and shared support for mTHP collapse.

No functional changes in this patch.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 48 ++++++++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 18 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cf94ccdfe751..2af8f50855d4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
 					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+					struct list_head *compound_pagelist,
+					u8 order)
 {
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= scaled_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared)) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -711,13 +713,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist,
+						u8 order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
@@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 					     pmd_t *pmd,
 					     pmd_t orig_pmd,
 					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+					     struct list_head *compound_pagelist,
+					     u8 order)
 {
 	spinlock_t *pmd_ptl;
 
@@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
 }
 
 /*
@@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, u8 order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 compound_pagelist, order);
 
 	return result;
 }
@@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
 				       struct vm_area_struct *vma,
 				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+				       int referenced, u8 order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = haddr + (PAGE_SIZE << order);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1035,6 +1039,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		if (!is_swap_pte(vmf.orig_pte))
 			continue;
 
+		/* Dont swapin for mTHP collapse */
+		if (order != HPAGE_PMD_ORDER) {
+			pte_unmap(pte);
+			mmap_read_unlock(mm);
+			result = SCAN_EXCEED_SWAP_PTE;
+			goto out;
+		}
+
 		vmf.pte = pte;
 		vmf.ptl = ptl;
 		ret = do_swap_page(&vmf);
@@ -1154,7 +1166,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+				referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1201,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist);
+					&compound_pagelist, HPAGE_PMD_ORDER);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1231,7 +1243,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (4 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-16  3:20   ` Baolin Wang
  2025-05-15  3:22 ` [PATCH v7 07/12] khugepaged: add " Nico Pache
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

khugepaged scans anons PMD ranges for potential collapse to a hugepage.
To add mTHP support we use this scan to instead record chunks of utilized
sections of the PMD.

khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
that represents chunks of utilized regions. We can then determine what
mTHP size fits best and in the following patch, we set this bitmap while
scanning the anon PMD. A minimum collapse order of 2 is used as this is
the lowest order supported by anon memory.

max_ptes_none is used as a scale to determine how "full" an order must
be before being considered for collapse.

When attempting to collapse an order that has its order set to "always"
lets always collapse to that order in a greedy manner without
considering the number of bits set.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/khugepaged.h |  4 ++
 mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 9 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index b8d69cfbb58b..b6e5ba1fae58 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -1,6 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
+#define KHUGEPAGED_MIN_MTHP_ORDER	2
+#define KHUGEPAGED_MIN_MTHP_NR	(1<<KHUGEPAGED_MIN_MTHP_ORDER)
+#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
+#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
 
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2af8f50855d4..044fec869b50 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+struct scan_bit_state {
+	u8 order;
+	u16 offset;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -102,6 +107,18 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/*
+	 * bitmap used to collapse mTHP sizes.
+	 * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
+	 */
+	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
+	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
+};
+
+struct collapse_control khugepaged_collapse_control = {
+	.is_khugepaged = true,
 };
 
 /**
@@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-struct collapse_control khugepaged_collapse_control = {
-	.is_khugepaged = true,
-};
-
 static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
@@ -1120,7 +1133,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, bool *mmap_locked,
+				  u8 order, u16 offset)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1139,8 +1153,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * The allocation can take potentially a long time if it involves
 	 * sync compaction, and we do not need to hold the mmap_lock during
 	 * that. We will recheck the vma after taking it again in write mode.
+	 * If collapsing mTHPs we may have already released the read_lock.
 	 */
-	mmap_read_unlock(mm);
+	if (*mmap_locked) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+	}
 
 	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
@@ -1275,12 +1293,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
+	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
 	return result;
 }
 
+/* Recursive function to consume the bitmap */
+static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
+			int referenced, int unmapped, struct collapse_control *cc,
+			bool *mmap_locked, unsigned long enabled_orders)
+{
+	u8 order, next_order;
+	u16 offset, mid_offset;
+	int num_chunks;
+	int bits_set, threshold_bits;
+	int top = -1;
+	int collapsed = 0;
+	int ret;
+	struct scan_bit_state state;
+	bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
+	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+		{ HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
+
+	while (top >= 0) {
+		state = cc->mthp_bitmap_stack[top--];
+		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
+		offset = state.offset;
+		num_chunks = 1 << (state.order);
+		// Skip mTHP orders that are not enabled
+		if (!test_bit(order, &enabled_orders))
+			goto next;
+
+		// copy the relavant section to a new bitmap
+		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
+				  MTHP_BITMAP_SIZE);
+
+		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
+		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
+				>> (HPAGE_PMD_ORDER - state.order);
+
+		//Check if the region is "almost full" based on the threshold
+		if (bits_set > threshold_bits || is_pmd_only
+			|| test_bit(order, &huge_anon_orders_always)) {
+			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
+					mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
+			if (ret == SCAN_SUCCEED) {
+				collapsed += (1 << order);
+				continue;
+			}
+		}
+
+next:
+		if (state.order > 0) {
+			next_order = state.order - 1;
+			mid_offset = offset + (num_chunks / 2);
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, mid_offset };
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, offset };
+			}
+	}
+	return collapsed;
+}
+
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
@@ -1447,9 +1525,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 07/12] khugepaged: add mTHP support
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (5 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-06-07  6:23   ` Dev Jain
  2025-05-15  3:22 ` [PATCH v7 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning PMD ranges for potential collapse candidates, keep track
of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
mTHPs are enabled we remove the restriction of max_ptes_none during the
scan phase so we dont bailout early and miss potential mTHP candidates.

After the scan is complete we will perform binary recursion on the
bitmap to determine which mTHP size would be most efficient to collapse
to. max_ptes_none will be scaled by the attempted collapse order to
determine how full a THP must be to be eligible.

If a mTHP collapse is attempted, but contains swapped out, or shared
pages, we dont perform the collapse.

For non PMD collapse we much leave the anon VMA write locked until after
we collapse the mTHP

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 136 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 95 insertions(+), 41 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 044fec869b50..afad75fc01b7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1138,13 +1138,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte = NULL, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	unsigned long _address = address + offset * PAGE_SIZE;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1160,12 +1161,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		*mmap_locked = false;
 	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1183,13 +1185,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-				referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
+				referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1199,7 +1202,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1210,11 +1213,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
+				_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+
 	/*
 	 * This removes any huge TLB entry from the CPU so we won't allow
 	 * huge and small TLB entries for the same virtual address to
@@ -1228,18 +1232,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-					&compound_pagelist, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
+					&compound_pagelist, order);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1254,17 +1256,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	}
 
 	/*
-	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
+	 * For PMD collapse all pages are isolated and locked so anon_vma
+	 * rmap can't run anymore
 	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	if (order == HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   &compound_pagelist, HPAGE_PMD_ORDER);
-	pte_unmap(pte);
+					   vma, _address, pte_ptl,
+					   &compound_pagelist, order);
 	if (unlikely(result != SCAN_SUCCEED))
-		goto out_up_write;
+		goto out_unlock_anon_vma;
 
 	/*
 	 * The smp_wmb() inside __folio_mark_uptodate() ensures the
@@ -1272,25 +1274,45 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
-	_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
-	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	set_pmd_at(mm, address, pmd, _pmd);
-	update_mmu_cache_pmd(vma, address, pmd);
-	deferred_split_folio(folio, false);
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
+		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, address, pmd, _pmd);
+		update_mmu_cache_pmd(vma, address, pmd);
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	} else { /* mTHP collapse */
+		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		folio_ref_add(folio, (1 << order) - 1);
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
+		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
+
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
+out_unlock_anon_vma:
+	if (order != HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 out_up_write:
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
 out_nolock:
 	*mmap_locked = false;
@@ -1366,31 +1388,57 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
+	int i;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long _address;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
+	bool is_pmd_only;
 	bool writable = false;
-
+	int chunk_none_count = 0;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+		tva_flags, THP_ORDERS_ALL_ANON);
+
+	is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, _address += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		/*
+		 * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
+		 * there are pages in this chunk keep track of it in the bitmap
+		 * for mTHP collapsing.
+		 */
+		if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
+			if (chunk_none_count <= scaled_none)
+				bitmap_set(cc->mthp_bitmap,
+					   i / KHUGEPAGED_MIN_MTHP_NR, 1);
+			chunk_none_count = 0;
+		}
+
+		_pte = pte + i;
+		_address = address + i * PAGE_SIZE;
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
@@ -1413,10 +1461,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			}
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			++chunk_none_count;
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			    (!cc->is_khugepaged || !is_pmd_only ||
+				none_or_zero <= khugepaged_max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1512,6 +1561,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 								     address)))
 			referenced++;
 	}
+
 	if (!writable) {
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
@@ -1524,8 +1574,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
+		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
+			       mmap_locked, enabled_orders);
+		if (result > 0)
+			result = SCAN_SUCCEED;
+		else
+			result = SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 08/12] khugepaged: skip collapsing mTHP to smaller orders
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (6 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 07/12] khugepaged: add " Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  3:22 ` [PATCH v7 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index afad75fc01b7..5920d4715a11 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -625,7 +625,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
-		/* See hpage_collapse_scan_pmd(). */
+		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
+
+		/* See khugepaged_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
 			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 09/12] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (7 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  3:22 ` [PATCH v7 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5920d4715a11..517cf2b271d7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1371,6 +1371,23 @@ static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
 				collapsed += (1 << order);
 				continue;
 			}
+			/*
+			 * Some ret values indicate all lower order will also
+			 * fail, dont trying to collapse smaller orders
+			 */
+			if (ret == SCAN_EXCEED_NONE_PTE ||
+				ret == SCAN_EXCEED_SWAP_PTE ||
+				ret == SCAN_EXCEED_SHARED_PTE ||
+				ret == SCAN_PTE_NON_PRESENT ||
+				ret == SCAN_PTE_UFFD_WP ||
+				ret == SCAN_ALLOC_HUGE_PAGE_FAIL ||
+				ret == SCAN_CGROUP_CHARGE_FAIL ||
+				ret == SCAN_COPY_MC ||
+				ret == SCAN_PAGE_LOCK ||
+				ret == SCAN_PAGE_COUNT)
+				goto next;
+			else
+				break;
 		}
 
 next:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 10/12] khugepaged: improve tracepoints for mTHP orders
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (8 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  3:22 ` [PATCH v7 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

Add the order to the tracepoints to give better insight into what order
is being operated at for khugepaged.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
 mm/khugepaged.c                    | 10 +++++----
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 2305df6cb485..70661bbf676f 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -92,34 +92,37 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 TRACE_EVENT(mm_collapse_huge_page,
 
-	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+	TP_PROTO(struct mm_struct *mm, int isolated, int status, int order),
 
-	TP_ARGS(mm, isolated, status),
+	TP_ARGS(mm, isolated, status, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, isolated)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->isolated = isolated;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, isolated=%d, status=%s",
+	TP_printk("mm=%p, isolated=%d, status=%s order=%d",
 		__entry->mm,
 		__entry->isolated,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct folio *folio, int none_or_zero,
-		 int referenced, bool  writable, int status),
+		 int referenced, bool  writable, int status, int order),
 
-	TP_ARGS(folio, none_or_zero, referenced, writable, status),
+	TP_ARGS(folio, none_or_zero, referenced, writable, status, order),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
@@ -127,6 +130,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__field(int, referenced)
 		__field(bool, writable)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -135,27 +139,31 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->referenced = referenced;
 		__entry->writable = writable;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s order=%d",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
 		__entry->writable,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_swapin,
 
-	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+			int order),
 
-	TP_ARGS(mm, swapped_in, referenced, ret),
+	TP_ARGS(mm, swapped_in, referenced, ret, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, swapped_in)
 		__field(int, referenced)
 		__field(int, ret)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -163,13 +171,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->swapped_in = swapped_in;
 		__entry->referenced = referenced;
 		__entry->ret = ret;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%d",
 		__entry->mm,
 		__entry->swapped_in,
 		__entry->referenced,
-		__entry->ret)
+		__entry->ret,
+		__entry->order)
 );
 
 TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 517cf2b271d7..951c44778f56 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -721,13 +721,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-						    referenced, writable, result);
+						    referenced, writable, result,
+						    order);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-					    referenced, writable, result);
+					    referenced, writable, result, order);
 	return result;
 }
 
@@ -1099,7 +1100,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 	result = SCAN_SUCCEED;
 out:
-	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+						order);
 	return result;
 }
 
@@ -1323,7 +1325,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
-	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
 	return result;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 11/12] khugepaged: add per-order mTHP khugepaged stats
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (9 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  3:22 ` [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

With mTHP support inplace, let add the per-order mTHP stats for
exceeding NONE, SWAP, and SHARED.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/huge_mm.h |  3 +++
 mm/huge_memory.c        |  7 +++++++
 mm/khugepaged.c         | 15 ++++++++++++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0bb65bd4e6dd..e3d15c737008 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -139,6 +139,9 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT_DEFERRED,
 	MTHP_STAT_NR_ANON,
 	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+	MTHP_STAT_COLLAPSE_EXCEED_NONE,
+	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 177f0a78666a..700988a0d5cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,6 +633,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
 DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
 DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -649,6 +653,9 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_deferred_attr.attr,
 	&nr_anon_attr.attr,
 	&nr_anon_partially_mapped_attr.attr,
+	&collapse_exceed_swap_pte_attr.attr,
+	&collapse_exceed_none_pte_attr.attr,
+	&collapse_exceed_shared_pte_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 951c44778f56..0723b184c7a4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -604,7 +604,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				if (order == HPAGE_PMD_ORDER)
+					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				else
+					count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
 				goto out;
 			}
 		}
@@ -633,8 +636,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See khugepaged_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared)) {
+			if (order != HPAGE_PMD_ORDER) {
+				result = SCAN_EXCEED_SHARED_PTE;
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+				goto out;
+			}
+
+			if (cc->is_khugepaged &&
+				shared > khugepaged_max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (10 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
@ 2025-05-15  3:22 ` Nico Pache
  2025-05-15  4:40   ` Randy Dunlap
       [not found]   ` <bc8f72f3-01d9-43db-a632-1f4b9a1d5276@arm.com>
  2025-05-28 12:31 ` [PATCH 1/2] mm: khugepaged: allow khugepaged to check all anonymous mTHP orders Baolin Wang
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-15  3:22 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, Bagas Sanjaya

Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidence on how to utilize it.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index dff8d5985f0f..5c63fe51b3ad 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,7 @@ often.
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -144,6 +144,18 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
 sizes, the kernel will select the most appropriate enabled size for a
 given allocation.
 
+khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size
+to determine collapses. When using mTHPs it's recommended to set
+max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page
+size). This will prevent undesired "creep" behavior that leads to
+continuously collapsing to a larger mTHP size; When we collapse, we are
+bringing in new non-zero pages that will, on a subsequent scan, cause the
+max_ptes_none check of the +1 order to always be satisfied. By limiting
+this to less than half the current order, we make sure we don't cause this
+feedback loop. max_ptes_shared and max_ptes_swap have no effect when
+collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out
+pages.
+
 It's also possible to limit defrag efforts in the VM to generate
 anonymous hugepages in case they're not immediately free to madvise
 regions or to never try to defrag memory and simply fallback to regular
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-05-15  3:22 ` [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2025-05-15  4:40   ` Randy Dunlap
       [not found]   ` <bc8f72f3-01d9-43db-a632-1f4b9a1d5276@arm.com>
  1 sibling, 0 replies; 57+ messages in thread
From: Randy Dunlap @ 2025-05-15  4:40 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, Bagas Sanjaya



On 5/14/25 8:22 PM, Nico Pache wrote:
> Now that we can collapse to mTHPs lets update the admin guide to
> reflect these changes and provide proper guidence on how to utilize it.
> 
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index dff8d5985f0f..5c63fe51b3ad 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -63,7 +63,7 @@ often.
>  THP can be enabled system wide or restricted to certain tasks or even
>  memory ranges inside task's address space. Unless THP is completely
>  disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages.
>  
>  The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
>  interface and using madvise(2) and prctl(2) system calls.
> @@ -144,6 +144,18 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
>  sizes, the kernel will select the most appropriate enabled size for a
>  given allocation.
>  
> +khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size
> +to determine collapses. When using mTHPs it's recommended to set
> +max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page
> +size). This will prevent undesired "creep" behavior that leads to
> +continuously collapsing to a larger mTHP size; When we collapse, we are

either                                      size. When
or                                          size; when

> +bringing in new non-zero pages that will, on a subsequent scan, cause the
> +max_ptes_none check of the +1 order to always be satisfied. By limiting
> +this to less than half the current order, we make sure we don't cause this
> +feedback loop. max_ptes_shared and max_ptes_swap have no effect when
> +collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out
> +pages.
> +
>  It's also possible to limit defrag efforts in the VM to generate
>  anonymous hugepages in case they're not immediately free to madvise
>  regions or to never try to defrag memory and simply fallback to regular

-- 
~Randy


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-05-15  3:22 ` [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-05-15  5:50   ` Baolin Wang
  2025-05-16 11:59     ` Nico Pache
  2025-05-16 17:12   ` Liam R. Howlett
  1 sibling, 1 reply; 57+ messages in thread
From: Baolin Wang @ 2025-05-15  5:50 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap



On 2025/5/15 11:22, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
> 
> Create khugepaged_collapse_single_pmd to increase code
> reuse and create an entry point for future khugepaged changes.
> 
> Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> the new khugepaged_collapse_single_pmd function.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
>   1 file changed, 49 insertions(+), 47 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 806bcd8c5185..5457571d505a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
>   	return result;
>   }
>   
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int khugepaged_collapse_single_pmd(unsigned long addr,
> +				   struct vm_area_struct *vma, bool *mmap_locked,
> +				   struct collapse_control *cc)
> +{
> +	int result = SCAN_FAIL;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {

I've removed the CONFIG_SHMEM dependency[1], please do not add it again.

[1] 
https://lore.kernel.org/all/ce5c2314e0368cf34bda26f9bacf01c982d4da17.1747119309.git.baolin.wang@linux.alibaba.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-15  3:22 ` [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-05-16  3:20   ` Baolin Wang
  2025-05-17  6:47     ` Nico Pache
  0 siblings, 1 reply; 57+ messages in thread
From: Baolin Wang @ 2025-05-16  3:20 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap



On 2025/5/15 11:22, Nico Pache wrote:
> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> To add mTHP support we use this scan to instead record chunks of utilized
> sections of the PMD.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of utilized regions. We can then determine what
> mTHP size fits best and in the following patch, we set this bitmap while
> scanning the anon PMD. A minimum collapse order of 2 is used as this is
> the lowest order supported by anon memory.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> When attempting to collapse an order that has its order set to "always"
> lets always collapse to that order in a greedy manner without
> considering the number of bits set.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

Sigh. You still haven't addressed or explained the issues I previously 
raised [1], so I don't know how to review this patch again...

[1] 
https://lore.kernel.org/all/83a66442-b7c7-42e7-af4e-fd211d8ed6f8@linux.alibaba.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-05-15  5:50   ` Baolin Wang
@ 2025-05-16 11:59     ` Nico Pache
  0 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-16 11:59 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On Wed, May 14, 2025 at 11:50 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/5/15 11:22, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create khugepaged_collapse_single_pmd to increase code
> > reuse and create an entry point for future khugepaged changes.
> >
> > Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> > the new khugepaged_collapse_single_pmd function.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
> >   1 file changed, 49 insertions(+), 47 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 806bcd8c5185..5457571d505a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >   }
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static int khugepaged_collapse_single_pmd(unsigned long addr,
> > +                                struct vm_area_struct *vma, bool *mmap_locked,
> > +                                struct collapse_control *cc)
> > +{
> > +     int result = SCAN_FAIL;
> > +     struct mm_struct *mm = vma->vm_mm;
> > +
> > +     if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
>
> I've removed the CONFIG_SHMEM dependency[1], please do not add it again.

Sorry I handled the conflict on the removal parts, forgot to handle
the addition part... my bad.
>
> [1]
> https://lore.kernel.org/all/ce5c2314e0368cf34bda26f9bacf01c982d4da17.1747119309.git.baolin.wang@linux.alibaba.com/
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-05-15  3:22 ` [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
  2025-05-15  5:50   ` Baolin Wang
@ 2025-05-16 17:12   ` Liam R. Howlett
  2025-07-02  0:00     ` Nico Pache
  1 sibling, 1 reply; 57+ messages in thread
From: Liam R. Howlett @ 2025-05-16 17:12 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

* Nico Pache <npache@redhat.com> [250514 23:23]:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
> 
> Create khugepaged_collapse_single_pmd to increase code
> reuse and create an entry point for future khugepaged changes.
> 
> Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> the new khugepaged_collapse_single_pmd function.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
>  1 file changed, 49 insertions(+), 47 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 806bcd8c5185..5457571d505a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int khugepaged_collapse_single_pmd(unsigned long addr,
> +				   struct vm_area_struct *vma, bool *mmap_locked,
> +				   struct collapse_control *cc)
> +{
> +	int result = SCAN_FAIL;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {

why IS_ENABLED(CONFIG_SHMEM) here, it seems new?

> +		struct file *file = get_file(vma->vm_file);
> +		pgoff_t pgoff = linear_page_index(vma, addr);
> +
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +		result = khugepaged_scan_file(mm, addr, file, pgoff, cc);
> +		fput(file);
> +		if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +			mmap_read_lock(mm);
> +			*mmap_locked = true;
> +			if (khugepaged_test_exit_or_disable(mm)) {
> +				result = SCAN_ANY_PROCESS;
> +				goto end;
> +			}
> +			result = collapse_pte_mapped_thp(mm, addr,
> +							 !cc->is_khugepaged);
> +			if (result == SCAN_PMD_MAPPED)
> +				result = SCAN_SUCCEED;
> +			mmap_read_unlock(mm);
> +			*mmap_locked = false;
> +		}
> +	} else {
> +		result = khugepaged_scan_pmd(mm, vma, addr, mmap_locked, cc);
> +	}
> +	if (cc->is_khugepaged && result == SCAN_SUCCEED)
> +		++khugepaged_pages_collapsed;
> +end:
> +	return result;

This function can return with mmap_read_locked or unlocked..

> +}
> +
>  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  					    struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
> @@ -2427,34 +2469,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>  				  hend);
> -			if (!vma_is_anonymous(vma)) {
> -				struct file *file = get_file(vma->vm_file);
> -				pgoff_t pgoff = linear_page_index(vma,
> -						khugepaged_scan.address);
> -
> -				mmap_read_unlock(mm);
> -				mmap_locked = false;
> -				*result = hpage_collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> -				fput(file);
> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -					mmap_read_lock(mm);
> -					if (hpage_collapse_test_exit_or_disable(mm))
> -						goto breakouterloop;
> -					*result = collapse_pte_mapped_thp(mm,
> -						khugepaged_scan.address, false);
> -					if (*result == SCAN_PMD_MAPPED)
> -						*result = SCAN_SUCCEED;
> -					mmap_read_unlock(mm);
> -				}
> -			} else {
> -				*result = hpage_collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> -			}
> -
> -			if (*result == SCAN_SUCCEED)
> -				++khugepaged_pages_collapsed;
>  
> +			*ngle_pmd(khugepaged_scan.address,
> +						vma, &mmap_locked, cc);
> +			/* If we return SCAN_ANY_PROCESS we are holding the mmap_lock */

But this comment makes it obvious that you know that..

> +			if (*result == SCAN_ANY_PROCESS)
> +				goto breakouterloop;

But later..

breakouterloop:                                                                                                          
        mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */                                               
breakouterloop_mmap_lock:


So if you return with SCAN_ANY_PROCESS, we are holding the lock and go
immediately and drop it.  This seems unnecessarily complicated and
involves a lock.

That would leave just the khugepaged_scan_pmd() path with the
unfortunate locking mess - which is a static function and called in one
location..

Looking at what happens after the return seems to indicate we could
clean that up as well, sometime later.

>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
>  			progress += HPAGE_PMD_NR;
> @@ -2773,36 +2793,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		mmap_assert_locked(mm);
>  		memset(cc->node_load, 0, sizeof(cc->node_load));
>  		nodes_clear(cc->alloc_nmask);
> -		if (!vma_is_anonymous(vma)) {
> -			struct file *file = get_file(vma->vm_file);
> -			pgoff_t pgoff = linear_page_index(vma, addr);
>  
> -			mmap_read_unlock(mm);
> -			mmap_locked = false;
> -			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> -							  cc);
> -			fput(file);
> -		} else {
> -			result = hpage_collapse_scan_pmd(mm, vma, addr,
> -							 &mmap_locked, cc);
> -		}
> +		result = khugepaged_collapse_single_pmd(addr, vma, &mmap_locked, cc);
> +
>  		if (!mmap_locked)
>  			*prev = NULL;  /* Tell caller we dropped mmap_lock */
>  
> -handle_result:
>  		switch (result) {
>  		case SCAN_SUCCEED:
>  		case SCAN_PMD_MAPPED:
>  			++thps;
>  			break;
>  		case SCAN_PTE_MAPPED_HUGEPAGE:
> -			BUG_ON(mmap_locked);
> -			BUG_ON(*prev);
> -			mmap_read_lock(mm);
> -			result = collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_read_unlock(mm);
> -			goto handle_result;

All of the above should probably be replaced with a BUG_ON(1) since it's
not expected now?  Or at least WARN_ON_ONCE(), but it should be safe to
continue if that's the case.

It looks like the mmap_locked boolean is used to ensure that *prev is
safe, but we are now dropping the lock and re-acquiring it (and
potentially returning here) with it set to true, so perv will not be set
to NULL like it should.

I think you can handle this by ensuring that
khugepaged_collapse_single_pmd() returns with mmap_locked false in the
SCAN_ANY_PROCESS case.

> -		/* Whitelisted set of results where continuing OK */

This seems worth keeping?

>  		case SCAN_PMD_NULL:
>  		case SCAN_PTE_NON_PRESENT:
>  		case SCAN_PTE_UFFD_WP:

I guess SCAN_ANY_PROCESS should be handled by the default case
statement?  It should probably be added to the switch?

That is to say, before your change the result would come from either
hpage_collapse_scan_file(), then lead to collapse_pte_mapped_thp()
above.

Now, you can have khugepaged_test_exit_or_disable() happen to return
SCAN_ANY_PROCESS and it will fall through to the default in this switch
statement, which seems like new behaviour?

At the very least, this information should be added to the git log on
what this patch does - if it's expected?

Thanks,
Liam

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-05-15  3:22 ` [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-05-16 17:14   ` Liam R. Howlett
  2025-06-29  6:52     ` Nico Pache
  2025-05-23  6:55   ` Baolin Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Liam R. Howlett @ 2025-05-16 17:14 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

* Nico Pache <npache@redhat.com> [250514 23:23]:
> For khugepaged to support different mTHP orders, we must generalize this
> to check if the PMD is not shared by another VMA and the order is
> enabled.
> 
> No functional change in this patch.

This patch needs to be with the functional change for git blame and
reviewing the changes.

> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5457571d505a..0c4d6a02d59c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
>  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  				   bool expect_anon,
>  				   struct vm_area_struct **vmap,
> -				   struct collapse_control *cc)
> +				   struct collapse_control *cc, int order)
>  {
>  	struct vm_area_struct *vma;
>  	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> @@ -934,7 +934,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  
>  	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>  		return SCAN_ADDRESS_RANGE;
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> +	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
>  		return SCAN_VMA_CHECK;
>  	/*
>  	 * Anon VMA expected, the address may be unmapped then
> @@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -2782,7 +2782,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  			mmap_read_lock(mm);
>  			mmap_locked = true;
>  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
> -							 cc);
> +							 cc, HPAGE_PMD_ORDER);
>  			if (result  != SCAN_SUCCEED) {
>  				last_fail = result;
>  				goto out_nolock;
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-05-15  3:22 ` [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
@ 2025-05-16 17:30   ` Liam R. Howlett
  2025-06-29  6:48     ` Nico Pache
  0 siblings, 1 reply; 57+ messages in thread
From: Liam R. Howlett @ 2025-05-16 17:30 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

* Nico Pache <npache@redhat.com> [250514 23:23]:
> functions in khugepaged.c use a mix of hpage_collapse and khugepaged
> as the function prefix.
> 
> rename all of them to khugepaged to keep things consistent and slightly
> shorten the function names.

I don't like what was done here, we've lost the context of what these
functions are used for (collapse). Are they used for other things
besides collapse?

I'd rather drop the prefix entirely than drop collapse from them all.
They are all static, so do we really need khugepaged_ at the start of
every static function in khugepaged.c?


> 
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 42 +++++++++++++++++++++---------------------
>  1 file changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cdf5a581368b..806bcd8c5185 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
>  	kmem_cache_destroy(mm_slot_cache);
>  }
>  
> -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> +static inline int khugepaged_test_exit(struct mm_struct *mm)
>  {
>  	return atomic_read(&mm->mm_users) == 0;
>  }
>  
> -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> +static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
>  {
> -	return hpage_collapse_test_exit(mm) ||
> +	return khugepaged_test_exit(mm) ||
>  	       test_bit(MMF_DISABLE_THP, &mm->flags);
>  }
>  
> @@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
>  	int wakeup;
>  
>  	/* __khugepaged_exit() must not run from under us */
> -	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> +	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
>  	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
>  		return;
>  
> @@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  	} else if (mm_slot) {
>  		/*
>  		 * This is required to serialize against
> -		 * hpage_collapse_test_exit() (which is guaranteed to run
> +		 * khugepaged_test_exit() (which is guaranteed to run
>  		 * under mmap sem read mode). Stop here (after we return all
>  		 * pagetables will be destroyed) until khugepaged has finished
>  		 * working on the pagetables under the mmap_lock.
> @@ -851,7 +851,7 @@ struct collapse_control khugepaged_collapse_control = {
>  	.is_khugepaged = true,
>  };
>  
> -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> +static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>  {
>  	int i;
>  
> @@ -886,7 +886,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>  }
>  
>  #ifdef CONFIG_NUMA
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
>  	int nid, target_node = 0, max_value = 0;
>  
> @@ -905,7 +905,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
>  	return target_node;
>  }
>  #else
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
>  	return 0;
>  }
> @@ -925,7 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  	struct vm_area_struct *vma;
>  	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>  
> -	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +	if (unlikely(khugepaged_test_exit_or_disable(mm)))
>  		return SCAN_ANY_PROCESS;
>  
>  	*vmap = vma = find_vma(mm, address);
> @@ -992,7 +992,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>  
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
> - * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
>   *
>   * Called and returns without pte mapped or spinlocks held.
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> @@ -1078,7 +1078,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  {
>  	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>  		     GFP_TRANSHUGE);
> -	int node = hpage_collapse_find_target_node(cc);
> +	int node = khugepaged_find_target_node(cc);
>  	struct folio *folio;
>  
>  	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> @@ -1264,7 +1264,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	return result;
>  }
>  
> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> +static int khugepaged_scan_pmd(struct mm_struct *mm,
>  				   struct vm_area_struct *vma,
>  				   unsigned long address, bool *mmap_locked,
>  				   struct collapse_control *cc)
> @@ -1378,7 +1378,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		 * hit record.
>  		 */
>  		node = folio_nid(folio);
> -		if (hpage_collapse_scan_abort(node, cc)) {
> +		if (khugepaged_scan_abort(node, cc)) {
>  			result = SCAN_SCAN_ABORT;
>  			goto out_unmap;
>  		}
> @@ -1447,7 +1447,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>  
>  	lockdep_assert_held(&khugepaged_mm_lock);
>  
> -	if (hpage_collapse_test_exit(mm)) {
> +	if (khugepaged_test_exit(mm)) {
>  		/* free mm_slot */
>  		hash_del(&slot->hash);
>  		list_del(&slot->mm_node);
> @@ -1740,7 +1740,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
>  			continue;
>  
> -		if (hpage_collapse_test_exit(mm))
> +		if (khugepaged_test_exit(mm))
>  			continue;
>  		/*
>  		 * When a vma is registered with uffd-wp, we cannot recycle
> @@ -2262,7 +2262,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  
> -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> +static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
>  				    struct file *file, pgoff_t start,
>  				    struct collapse_control *cc)
>  {
> @@ -2307,7 +2307,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  
>  		node = folio_nid(folio);
> -		if (hpage_collapse_scan_abort(node, cc)) {
> +		if (khugepaged_scan_abort(node, cc)) {
>  			result = SCAN_SCAN_ABORT;
>  			break;
>  		}
> @@ -2391,7 +2391,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  		goto breakouterloop_mmap_lock;
>  
>  	progress++;
> -	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +	if (unlikely(khugepaged_test_exit_or_disable(mm)))
>  		goto breakouterloop;
>  
>  	vma_iter_init(&vmi, mm, khugepaged_scan.address);
> @@ -2399,7 +2399,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  		unsigned long hstart, hend;
>  
>  		cond_resched();
> -		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> +		if (unlikely(khugepaged_test_exit_or_disable(mm))) {
>  			progress++;
>  			break;
>  		}
> @@ -2421,7 +2421,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			bool mmap_locked = true;
>  
>  			cond_resched();
> -			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> +			if (unlikely(khugepaged_test_exit_or_disable(mm)))
>  				goto breakouterloop;
>  
>  			VM_BUG_ON(khugepaged_scan.address < hstart ||
> @@ -2481,7 +2481,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  	 * Release the current mm_slot if this mm is about to die, or
>  	 * if we scanned all vmas of this mm.
>  	 */
> -	if (hpage_collapse_test_exit(mm) || !vma) {
> +	if (khugepaged_test_exit(mm) || !vma) {
>  		/*
>  		 * Make sure that if mm_users is reaching zero while
>  		 * khugepaged runs here, khugepaged_exit will find
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-16  3:20   ` Baolin Wang
@ 2025-05-17  6:47     ` Nico Pache
  2025-05-18  3:04       ` Liam R. Howlett
  2025-05-20 10:09       ` Baolin Wang
  0 siblings, 2 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-17  6:47 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On Thu, May 15, 2025 at 9:20 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/5/15 11:22, Nico Pache wrote:
> > khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> > To add mTHP support we use this scan to instead record chunks of utilized
> > sections of the PMD.
> >
> > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> > that represents chunks of utilized regions. We can then determine what
> > mTHP size fits best and in the following patch, we set this bitmap while
> > scanning the anon PMD. A minimum collapse order of 2 is used as this is
> > the lowest order supported by anon memory.
> >
> > max_ptes_none is used as a scale to determine how "full" an order must
> > be before being considered for collapse.
> >
> > When attempting to collapse an order that has its order set to "always"
> > lets always collapse to that order in a greedy manner without
> > considering the number of bits set.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> Sigh. You still haven't addressed or explained the issues I previously
> raised [1], so I don't know how to review this patch again...
Can you still reproduce this issue?
I can no longer reproduce this issue, that's why I posted... although
I should have followed up, and looked into what the original issue
was. Nothing really sticks out so perhaps something in mm-new was
broken and pulled out... not sure.

It should now follow the expected behavior, which is that no mTHP
collapse occurs because if the PMD size is disabled so is khugepaged
collapse.

Lmk if you are still experiencing this issue please.

Cheers,
-- Nico
>
> [1]
> https://lore.kernel.org/all/83a66442-b7c7-42e7-af4e-fd211d8ed6f8@linux.alibaba.com/
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-17  6:47     ` Nico Pache
@ 2025-05-18  3:04       ` Liam R. Howlett
  2025-05-20 10:09       ` Baolin Wang
  1 sibling, 0 replies; 57+ messages in thread
From: Liam R. Howlett @ 2025-05-18  3:04 UTC (permalink / raw)
  To: Nico Pache
  Cc: Baolin Wang, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, david, ziy, lorenzo.stoakes, ryan.roberts,
	dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap

* Nico Pache <npache@redhat.com> [250517 02:48]:
> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2025/5/15 11:22, Nico Pache wrote:
> > > khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> > > To add mTHP support we use this scan to instead record chunks of utilized
> > > sections of the PMD.
> > >
> > > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> > > that represents chunks of utilized regions. We can then determine what
> > > mTHP size fits best and in the following patch, we set this bitmap while
> > > scanning the anon PMD. A minimum collapse order of 2 is used as this is
> > > the lowest order supported by anon memory.
> > >
> > > max_ptes_none is used as a scale to determine how "full" an order must
> > > be before being considered for collapse.
> > >
> > > When attempting to collapse an order that has its order set to "always"
> > > lets always collapse to that order in a greedy manner without
> > > considering the number of bits set.
> > >
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> >
> > Sigh. You still haven't addressed or explained the issues I previously
> > raised [1], so I don't know how to review this patch again...

Thanks Baolin for highlighting this.

> Can you still reproduce this issue?
> I can no longer reproduce this issue, that's why I posted... although
> I should have followed up, and looked into what the original issue
> was. Nothing really sticks out so perhaps something in mm-new was
> broken and pulled out... not sure.

This was verified as an issue by you and you didn't mention it in the
cover letter or the patches?  It's one thing if you need help testing
changes when you can't recreate the issue, but this is not the case
here.  What's worse is that you said you would look into it further, but
then dropped any mention of the issue.

Every comment on your patches should be treated as a bug report and
failure to address them (in code or reply) means your patches should not
be taken.  Otherwise things fall through the cracks (and then get picked
up by syzbot, followed by a CVE and a bounty collection - if we are
lucky).

> 
> It should now follow the expected behavior, which is that no mTHP
> collapse occurs because if the PMD size is disabled so is khugepaged
> collapse.
> 
> Lmk if you are still experiencing this issue please.
> 

I'm sorry, but that's not how this process works.

Let us know if it's still an issue and detail the fix and/or cause of
the issue so that we are confident that it's handled diligently, and
know what to check to make sure the fix make sense.

Everyone misses things, but the fact that you are saying you should have
followed up, but are still requesting someone else figure it out is
troubling.  It points to an incomplete understanding of the issue(s).

Does it happen in mm-stable or mm-unstable with your patches?  Were
there other fixes that went in between the spins that may have affected
it?  Please take the time to fully understand what happened and why it
is no longer happening.

If you cannot get to the bottom of it, then don't send out another
revision - follow up to the initial comments and get help to dig deeper.
Maybe there is a bug here that has nothing to do with your patches, but
you have triggered it for some reason and we need more investigation.

I'm not willing to accept patches that now work for an unknown reason,
and neither should you.

> Cheers,
> -- Nico
> >
> > [1]
> > https://lore.kernel.org/all/83a66442-b7c7-42e7-af4e-fd211d8ed6f8@linux.alibaba.com/
> >
> 

Thanks,
Liam

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-17  6:47     ` Nico Pache
  2025-05-18  3:04       ` Liam R. Howlett
@ 2025-05-20 10:09       ` Baolin Wang
  2025-05-20 10:26         ` David Hildenbrand
  2025-05-21 10:23         ` Nico Pache
  1 sibling, 2 replies; 57+ messages in thread
From: Baolin Wang @ 2025-05-20 10:09 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

Sorry for late reply.

On 2025/5/17 14:47, Nico Pache wrote:
> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/5/15 11:22, Nico Pache wrote:
>>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
>>> To add mTHP support we use this scan to instead record chunks of utilized
>>> sections of the PMD.
>>>
>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>>> that represents chunks of utilized regions. We can then determine what
>>> mTHP size fits best and in the following patch, we set this bitmap while
>>> scanning the anon PMD. A minimum collapse order of 2 is used as this is
>>> the lowest order supported by anon memory.
>>>
>>> max_ptes_none is used as a scale to determine how "full" an order must
>>> be before being considered for collapse.
>>>
>>> When attempting to collapse an order that has its order set to "always"
>>> lets always collapse to that order in a greedy manner without
>>> considering the number of bits set.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>
>> Sigh. You still haven't addressed or explained the issues I previously
>> raised [1], so I don't know how to review this patch again...
> Can you still reproduce this issue?

Yes, I can still reproduce this issue with today's (5/20) mm-new branch.

I've disabled PMD-sized THP in my system:
[root]# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
[root]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
always inherit madvise [never]

And I tried calling madvise() with MADV_COLLAPSE for anonymous memory, 
and I can still see it collapsing to a PMD-sized THP.

> I can no longer reproduce this issue, that's why I posted... although
> I should have followed up, and looked into what the original issue
> was. Nothing really sticks out so perhaps something in mm-new was
> broken and pulled out... not sure.
> 
> It should now follow the expected behavior, which is that no mTHP
> collapse occurs because if the PMD size is disabled so is khugepaged
> collapse.
> 
> Lmk if you are still experiencing this issue please.
> 
> Cheers,
> -- Nico
>>
>> [1]
>> https://lore.kernel.org/all/83a66442-b7c7-42e7-af4e-fd211d8ed6f8@linux.alibaba.com/
>>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-20 10:09       ` Baolin Wang
@ 2025-05-20 10:26         ` David Hildenbrand
  2025-05-21  1:03           ` Baolin Wang
  2025-05-21 10:23         ` Nico Pache
  1 sibling, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2025-05-20 10:26 UTC (permalink / raw)
  To: Baolin Wang, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On 20.05.25 12:09, Baolin Wang wrote:
> Sorry for late reply.
> 
> On 2025/5/17 14:47, Nico Pache wrote:
>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> On 2025/5/15 11:22, Nico Pache wrote:
>>>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
>>>> To add mTHP support we use this scan to instead record chunks of utilized
>>>> sections of the PMD.
>>>>
>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>>>> that represents chunks of utilized regions. We can then determine what
>>>> mTHP size fits best and in the following patch, we set this bitmap while
>>>> scanning the anon PMD. A minimum collapse order of 2 is used as this is
>>>> the lowest order supported by anon memory.
>>>>
>>>> max_ptes_none is used as a scale to determine how "full" an order must
>>>> be before being considered for collapse.
>>>>
>>>> When attempting to collapse an order that has its order set to "always"
>>>> lets always collapse to that order in a greedy manner without
>>>> considering the number of bits set.
>>>>
>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>
>>> Sigh. You still haven't addressed or explained the issues I previously
>>> raised [1], so I don't know how to review this patch again...
>> Can you still reproduce this issue?
> 
> Yes, I can still reproduce this issue with today's (5/20) mm-new branch.
> 
> I've disabled PMD-sized THP in my system:
> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> [root]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> always inherit madvise [never]

Thanks for the easy reproducer, Baolin! It's certainly something that 
must be fixed.

> 
> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
> and I can still see it collapsing to a PMD-sized THP.

This almost sounds like it could be converted into an easy selftest.

Baolin, do you have other ideas for easy selftests? It might be good to 
include some in the next version.

I can think of: enable only a single size, then MADV_COLLAPSE X times 
and see if it worked. etc.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-20 10:26         ` David Hildenbrand
@ 2025-05-21  1:03           ` Baolin Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Baolin Wang @ 2025-05-21  1:03 UTC (permalink / raw)
  To: David Hildenbrand, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap



On 2025/5/20 18:26, David Hildenbrand wrote:
> On 20.05.25 12:09, Baolin Wang wrote:
>> Sorry for late reply.
>>
>> On 2025/5/17 14:47, Nico Pache wrote:
>>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2025/5/15 11:22, Nico Pache wrote:
>>>>> khugepaged scans anons PMD ranges for potential collapse to a 
>>>>> hugepage.
>>>>> To add mTHP support we use this scan to instead record chunks of 
>>>>> utilized
>>>>> sections of the PMD.
>>>>>
>>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a 
>>>>> bitmap
>>>>> that represents chunks of utilized regions. We can then determine what
>>>>> mTHP size fits best and in the following patch, we set this bitmap 
>>>>> while
>>>>> scanning the anon PMD. A minimum collapse order of 2 is used as 
>>>>> this is
>>>>> the lowest order supported by anon memory.
>>>>>
>>>>> max_ptes_none is used as a scale to determine how "full" an order must
>>>>> be before being considered for collapse.
>>>>>
>>>>> When attempting to collapse an order that has its order set to 
>>>>> "always"
>>>>> lets always collapse to that order in a greedy manner without
>>>>> considering the number of bits set.
>>>>>
>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>>
>>>> Sigh. You still haven't addressed or explained the issues I previously
>>>> raised [1], so I don't know how to review this patch again...
>>> Can you still reproduce this issue?
>>
>> Yes, I can still reproduce this issue with today's (5/20) mm-new branch.
>>
>> I've disabled PMD-sized THP in my system:
>> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
>> always madvise [never]
>> [root]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>> always inherit madvise [never]
> 
> Thanks for the easy reproducer, Baolin! It's certainly something that 
> must be fixed.
> 
>>
>> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
>> and I can still see it collapsing to a PMD-sized THP.
> 
> This almost sounds like it could be converted into an easy selftest.
> 
> Baolin, do you have other ideas for easy selftests? It might be good to 
> include some in the next version. >
> I can think of: enable only a single size, then MADV_COLLAPSE X times 
> and see if it worked. etc.

Yes. And some easy test cases I want to write are:

(1) Enable all mTHP size, tuning 'max_ptes_none' parameter to check if 
the suitable sized mTHP is being collapsed (including MADV_COLLAPSE and 
khugepaged).
(2) Enable only a single size, calling MADV_COLLAPSE madvise() or 
khugepaged to check if the suitable sized mTHP is being collapsed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-20 10:09       ` Baolin Wang
  2025-05-20 10:26         ` David Hildenbrand
@ 2025-05-21 10:23         ` Nico Pache
  2025-05-22  9:39           ` Baolin Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-05-21 10:23 UTC (permalink / raw)
  To: Baolin Wang, David Rientjes, zokeefe
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, hannes, mhocko, rdunlap

On Tue, May 20, 2025 at 4:09 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Sorry for late reply.
>
> On 2025/5/17 14:47, Nico Pache wrote:
> > On Thu, May 15, 2025 at 9:20 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 2025/5/15 11:22, Nico Pache wrote:
> >>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> >>> To add mTHP support we use this scan to instead record chunks of utilized
> >>> sections of the PMD.
> >>>
> >>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> >>> that represents chunks of utilized regions. We can then determine what
> >>> mTHP size fits best and in the following patch, we set this bitmap while
> >>> scanning the anon PMD. A minimum collapse order of 2 is used as this is
> >>> the lowest order supported by anon memory.
> >>>
> >>> max_ptes_none is used as a scale to determine how "full" an order must
> >>> be before being considered for collapse.
> >>>
> >>> When attempting to collapse an order that has its order set to "always"
> >>> lets always collapse to that order in a greedy manner without
> >>> considering the number of bits set.
> >>>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>
> >> Sigh. You still haven't addressed or explained the issues I previously
> >> raised [1], so I don't know how to review this patch again...
> > Can you still reproduce this issue?
>
> Yes, I can still reproduce this issue with today's (5/20) mm-new branch.
>
> I've disabled PMD-sized THP in my system:
> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> [root]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> always inherit madvise [never]
>
> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
> and I can still see it collapsing to a PMD-sized THP.
Hi Baolin ! Thank you for your reply and willingness to test again :)

I didn't realize we were talking about madvise collapse-- this makes
sense now. I also figured out why I could "reproduce" it before. My
script was always enabling the THP settings in two places, and I only
commented out one to test this. But this time I was doing more manual
testing.

The original design of madvise_collapse ignores the sysfs and
collapses even if you have an order disabled. I believe this behavior
is wrong, but by design. I spent some time playing around with madvise
collapses with and w/o my changes. This is not a new thing, I
reproduced the issue in 6.11 (Fedora 41), and I think its been
possible since the inception of madvise collapse 3 years ago. I
noticed a similar behavior on one of my RFC since it was "breaking"
selftests, and the fix was to reincorporate this broken sysfs
behavior.

7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
"This call is independent of the system-wide THP sysfs settings, but
will fail for memory marked VM_NOHUGEPAGE."

The second condition holds true (and fails for VM_NOHUGEPAGE), but I
dont know if we actually want madvise_collapse to be independent of
the system-wide.

So I'll ask the authors
+David Rientjes +zokeefe@google.com
Was this brought up as a concern when this feature was first
introduced, was there any pushback, what was the outcome of the
discussion if so?
I can easily fix this and it would further simplify the code (by
removing the is_khugepaged and friends). As David H. has brought up in
other discussions around similar topics, never should mean never, is
this the only exception we should allow?

Thanks!
>
> > I can no longer reproduce this issue, that's why I posted... although
> > I should have followed up, and looked into what the original issue
> > was. Nothing really sticks out so perhaps something in mm-new was
> > broken and pulled out... not sure.
> >
> > It should now follow the expected behavior, which is that no mTHP
> > collapse occurs because if the PMD size is disabled so is khugepaged
> > collapse.
> >
> > Lmk if you are still experiencing this issue please.
> >
> > Cheers,
> > -- Nico
> >>
> >> [1]
> >> https://lore.kernel.org/all/83a66442-b7c7-42e7-af4e-fd211d8ed6f8@linux.alibaba.com/
> >>
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-21 10:23         ` Nico Pache
@ 2025-05-22  9:39           ` Baolin Wang
  2025-05-28  9:26             ` David Hildenbrand
  0 siblings, 1 reply; 57+ messages in thread
From: Baolin Wang @ 2025-05-22  9:39 UTC (permalink / raw)
  To: Nico Pache, David Rientjes, zokeefe
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, hannes, mhocko, rdunlap



On 2025/5/21 18:23, Nico Pache wrote:
> On Tue, May 20, 2025 at 4:09 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Sorry for late reply.
>>
>> On 2025/5/17 14:47, Nico Pache wrote:
>>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2025/5/15 11:22, Nico Pache wrote:
>>>>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
>>>>> To add mTHP support we use this scan to instead record chunks of utilized
>>>>> sections of the PMD.
>>>>>
>>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>>>>> that represents chunks of utilized regions. We can then determine what
>>>>> mTHP size fits best and in the following patch, we set this bitmap while
>>>>> scanning the anon PMD. A minimum collapse order of 2 is used as this is
>>>>> the lowest order supported by anon memory.
>>>>>
>>>>> max_ptes_none is used as a scale to determine how "full" an order must
>>>>> be before being considered for collapse.
>>>>>
>>>>> When attempting to collapse an order that has its order set to "always"
>>>>> lets always collapse to that order in a greedy manner without
>>>>> considering the number of bits set.
>>>>>
>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>>
>>>> Sigh. You still haven't addressed or explained the issues I previously
>>>> raised [1], so I don't know how to review this patch again...
>>> Can you still reproduce this issue?
>>
>> Yes, I can still reproduce this issue with today's (5/20) mm-new branch.
>>
>> I've disabled PMD-sized THP in my system:
>> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
>> always madvise [never]
>> [root]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>> always inherit madvise [never]
>>
>> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
>> and I can still see it collapsing to a PMD-sized THP.
> Hi Baolin ! Thank you for your reply and willingness to test again :)
> 
> I didn't realize we were talking about madvise collapse-- this makes
> sense now. I also figured out why I could "reproduce" it before. My
> script was always enabling the THP settings in two places, and I only
> commented out one to test this. But this time I was doing more manual
> testing.
> 
> The original design of madvise_collapse ignores the sysfs and
> collapses even if you have an order disabled. I believe this behavior
> is wrong, but by design. I spent some time playing around with madvise
> collapses with and w/o my changes. This is not a new thing, I
> reproduced the issue in 6.11 (Fedora 41), and I think its been
> possible since the inception of madvise collapse 3 years ago. I
> noticed a similar behavior on one of my RFC since it was "breaking"
> selftests, and the fix was to reincorporate this broken sysfs
> behavior.

OK. Thanks for the explanation.

> 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
> "This call is independent of the system-wide THP sysfs settings, but
> will fail for memory marked VM_NOHUGEPAGE."
> 
> The second condition holds true (and fails for VM_NOHUGEPAGE), but I
> dont know if we actually want madvise_collapse to be independent of
> the system-wide.

This design principle surprised me a bit, and I failed to find the 
reason in the commit log. I agree that "never should mean never," and we 
should respect the THP/mTHP sysfs setting. Additionally, for the 
'shmem_enabled' sysfs interface controlled for shmem/tmpfs, THP collapse 
can still be prohibited through the 'deny' configuration. The rules here 
are somewhat confusing.

> So I'll ask the authors
> +David Rientjes +zokeefe@google.com
> Was this brought up as a concern when this feature was first
> introduced, was there any pushback, what was the outcome of the
> discussion if so?
> I can easily fix this and it would further simplify the code (by
> removing the is_khugepaged and friends). As David H. has brought up in
> other discussions around similar topics, never should mean never, is
> this the only exception we should allow?

I don't think we need this exception, unless there is some solid reason.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-05-15  3:22 ` [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
  2025-05-16 17:14   ` Liam R. Howlett
@ 2025-05-23  6:55   ` Baolin Wang
  2025-05-28  6:57     ` Dev Jain
  2025-05-29  4:00     ` Nico Pache
  1 sibling, 2 replies; 57+ messages in thread
From: Baolin Wang @ 2025-05-23  6:55 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap



On 2025/5/15 11:22, Nico Pache wrote:
> For khugepaged to support different mTHP orders, we must generalize this
> to check if the PMD is not shared by another VMA and the order is
> enabled.
> 
> No functional change in this patch.
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5457571d505a..0c4d6a02d59c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
>   static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>   				   bool expect_anon,
>   				   struct vm_area_struct **vmap,
> -				   struct collapse_control *cc)
> +				   struct collapse_control *cc, int order)
>   {
>   	struct vm_area_struct *vma;
>   	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> @@ -934,7 +934,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>   
>   	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))

Sorry, I missed this before. Should we also change 'PMD_ORDER' to 
'order' for the thp_vma_suitable_order()?

>   		return SCAN_ADDRESS_RANGE;
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> +	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
>   		return SCAN_VMA_CHECK;
>   	/*
>   	 * Anon VMA expected, the address may be unmapped then
> @@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   		goto out_nolock;
>   
>   	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED) {
>   		mmap_read_unlock(mm);
>   		goto out_nolock;
> @@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * mmap_lock.
>   	 */
>   	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   	/* check if the pmd is still valid */
> @@ -2782,7 +2782,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   			mmap_read_lock(mm);
>   			mmap_locked = true;
>   			result = hugepage_vma_revalidate(mm, addr, false, &vma,
> -							 cc);
> +							 cc, HPAGE_PMD_ORDER);
>   			if (result  != SCAN_SUCCEED) {
>   				last_fail = result;
>   				goto out_nolock;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-05-23  6:55   ` Baolin Wang
@ 2025-05-28  6:57     ` Dev Jain
  2025-05-29  4:00     ` Nico Pache
  1 sibling, 0 replies; 57+ messages in thread
From: Dev Jain @ 2025-05-28  6:57 UTC (permalink / raw)
  To: Baolin Wang, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap


On 23/05/25 12:25 pm, Baolin Wang wrote:
>
>
> On 2025/5/15 11:22, Nico Pache wrote:
>> For khugepaged to support different mTHP orders, we must generalize this
>> to check if the PMD is not shared by another VMA and the order is
>> enabled.
>>
>> No functional change in this patch.
>>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 10 +++++-----
>>   1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 5457571d505a..0c4d6a02d59c 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct 
>> collapse_control *cc)
>>   static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned 
>> long address,
>>                      bool expect_anon,
>>                      struct vm_area_struct **vmap,
>> -                   struct collapse_control *cc)
>> +                   struct collapse_control *cc, int order)
>>   {
>>       struct vm_area_struct *vma;
>>       unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS 
>> : 0;
>> @@ -934,7 +934,7 @@ static int hugepage_vma_revalidate(struct 
>> mm_struct *mm, unsigned long address,
>>         if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>
> Sorry, I missed this before. Should we also change 'PMD_ORDER' to 
> 'order' for the thp_vma_suitable_order()?


You are right, I did that in my patch:

https://lore.kernel.org/all/20250211111326.14295-3-dev.jain@arm.com/


>
>>           return SCAN_ADDRESS_RANGE;
>> -    if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, 
>> PMD_ORDER))
>> +    if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
>>           return SCAN_VMA_CHECK;
>>       /*
>>        * Anon VMA expected, the address may be unmapped then
>> @@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct 
>> *mm, unsigned long address,
>>           goto out_nolock;
>>         mmap_read_lock(mm);
>> -    result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>> +    result = hugepage_vma_revalidate(mm, address, true, &vma, cc, 
>> HPAGE_PMD_ORDER);
>>       if (result != SCAN_SUCCEED) {
>>           mmap_read_unlock(mm);
>>           goto out_nolock;
>> @@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct 
>> *mm, unsigned long address,
>>        * mmap_lock.
>>        */
>>       mmap_write_lock(mm);
>> -    result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>> +    result = hugepage_vma_revalidate(mm, address, true, &vma, cc, 
>> HPAGE_PMD_ORDER);
>>       if (result != SCAN_SUCCEED)
>>           goto out_up_write;
>>       /* check if the pmd is still valid */
>> @@ -2782,7 +2782,7 @@ int madvise_collapse(struct vm_area_struct 
>> *vma, struct vm_area_struct **prev,
>>               mmap_read_lock(mm);
>>               mmap_locked = true;
>>               result = hugepage_vma_revalidate(mm, addr, false, &vma,
>> -                             cc);
>> +                             cc, HPAGE_PMD_ORDER);
>>               if (result  != SCAN_SUCCEED) {
>>                   last_fail = result;
>>                   goto out_nolock;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-22  9:39           ` Baolin Wang
@ 2025-05-28  9:26             ` David Hildenbrand
  2025-05-28 14:04               ` Baolin Wang
  0 siblings, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2025-05-28  9:26 UTC (permalink / raw)
  To: Baolin Wang, Nico Pache, David Rientjes, zokeefe
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, hannes, mhocko, rdunlap

On 22.05.25 11:39, Baolin Wang wrote:
> 
> 
> On 2025/5/21 18:23, Nico Pache wrote:
>> On Tue, May 20, 2025 at 4:09 AM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>> Sorry for late reply.
>>>
>>> On 2025/5/17 14:47, Nico Pache wrote:
>>>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2025/5/15 11:22, Nico Pache wrote:
>>>>>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
>>>>>> To add mTHP support we use this scan to instead record chunks of utilized
>>>>>> sections of the PMD.
>>>>>>
>>>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>>>>>> that represents chunks of utilized regions. We can then determine what
>>>>>> mTHP size fits best and in the following patch, we set this bitmap while
>>>>>> scanning the anon PMD. A minimum collapse order of 2 is used as this is
>>>>>> the lowest order supported by anon memory.
>>>>>>
>>>>>> max_ptes_none is used as a scale to determine how "full" an order must
>>>>>> be before being considered for collapse.
>>>>>>
>>>>>> When attempting to collapse an order that has its order set to "always"
>>>>>> lets always collapse to that order in a greedy manner without
>>>>>> considering the number of bits set.
>>>>>>
>>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>>>
>>>>> Sigh. You still haven't addressed or explained the issues I previously
>>>>> raised [1], so I don't know how to review this patch again...
>>>> Can you still reproduce this issue?
>>>
>>> Yes, I can still reproduce this issue with today's (5/20) mm-new branch.
>>>
>>> I've disabled PMD-sized THP in my system:
>>> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
>>> always madvise [never]
>>> [root]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>> always inherit madvise [never]
>>>
>>> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
>>> and I can still see it collapsing to a PMD-sized THP.
>> Hi Baolin ! Thank you for your reply and willingness to test again :)
>>
>> I didn't realize we were talking about madvise collapse-- this makes
>> sense now. I also figured out why I could "reproduce" it before. My
>> script was always enabling the THP settings in two places, and I only
>> commented out one to test this. But this time I was doing more manual
>> testing.
>>
>> The original design of madvise_collapse ignores the sysfs and
>> collapses even if you have an order disabled. I believe this behavior
>> is wrong, but by design. I spent some time playing around with madvise
>> collapses with and w/o my changes. This is not a new thing, I
>> reproduced the issue in 6.11 (Fedora 41), and I think its been
>> possible since the inception of madvise collapse 3 years ago. I
>> noticed a similar behavior on one of my RFC since it was "breaking"
>> selftests, and the fix was to reincorporate this broken sysfs
>> behavior.
> 
> OK. Thanks for the explanation.
> 
>> 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
>> "This call is independent of the system-wide THP sysfs settings, but
>> will fail for memory marked VM_NOHUGEPAGE."
>>
>> The second condition holds true (and fails for VM_NOHUGEPAGE), but I
>> dont know if we actually want madvise_collapse to be independent of
>> the system-wide.
> 
> This design principle surprised me a bit, and I failed to find the
> reason in the commit log. I agree that "never should mean never," and we
> should respect the THP/mTHP sysfs setting. Additionally, for the
> 'shmem_enabled' sysfs interface controlled for shmem/tmpfs, THP collapse
> can still be prohibited through the 'deny' configuration. The rules here
> are somewhat confusing.

I recall that we decided to overwrite "VM_NOHUGEPAGE", because the 
assumption is that the same app that triggered MADV_NOHUGEPAGE triggers 
the collapse. So the app decides on its own behavior.

Similarly, allowing for collapsing in a VM without VM_HUGEPAGE in the 
"madvise" mode would be fine.

But in the "never" case, we should just "never" collapse.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/2] mm: khugepaged: allow khugepaged to check all anonymous mTHP orders
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (11 preceding siblings ...)
  2025-05-15  3:22 ` [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2025-05-28 12:31 ` Baolin Wang
  2025-05-28 12:31   ` [PATCH 2/2] mm: khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Baolin Wang
  2025-05-28 12:39 ` [PATCH v7 00/12] khugepaged: mTHP support Baolin Wang
  2025-06-16  3:51 ` Dev Jain
  14 siblings, 1 reply; 57+ messages in thread
From: Baolin Wang @ 2025-05-28 12:31 UTC (permalink / raw)
  To: npache
  Cc: Liam.Howlett, aarcange, akpm, anshuman.khandual, baohua,
	baolin.wang, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, hannes, jack, jglisse, kirill.shutemov, linux-doc,
	linux-kernel, linux-mm, linux-trace-kernel, lorenzo.stoakes,
	mathieu.desnoyers, mhiramat, mhocko, peterx, raquini, rdunlap,
	rientjes, rostedt, ryan.roberts, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vishal.moola,
	wangkefeng.wang, will, willy, yang, ziy, zokeefe

We have now allowed mTHP collapse, but thp_vma_allowable_order() still only
checks if the PMD-sized mTHP is allowed to collapse. This prevents scanning
and collapsing of 64K mTHP when only 64K mTHP is enabled. Thus, we should
modify the checks to allow all large orders of anonymous mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/khugepaged.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0723b184c7a4..16542ecf02dc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -491,8 +491,11 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+		unsigned long orders = vma_is_anonymous(vma) ?
+					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
+
+		if (thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
+					    orders))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -2618,6 +2621,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
 	for_each_vma(vmi, vma) {
+		unsigned long orders = vma_is_anonymous(vma) ?
+					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
 		unsigned long hstart, hend;
 
 		cond_resched();
@@ -2625,8 +2630,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+		if (!thp_vma_allowable_orders(vma, vma->vm_flags,
+					TVA_ENFORCE_SYSFS, orders)) {
 skip:
 			progress++;
 			continue;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/2] mm: khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
  2025-05-28 12:31 ` [PATCH 1/2] mm: khugepaged: allow khugepaged to check all anonymous mTHP orders Baolin Wang
@ 2025-05-28 12:31   ` Baolin Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Baolin Wang @ 2025-05-28 12:31 UTC (permalink / raw)
  To: npache
  Cc: Liam.Howlett, aarcange, akpm, anshuman.khandual, baohua,
	baolin.wang, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, hannes, jack, jglisse, kirill.shutemov, linux-doc,
	linux-kernel, linux-mm, linux-trace-kernel, lorenzo.stoakes,
	mathieu.desnoyers, mhiramat, mhocko, peterx, raquini, rdunlap,
	rientjes, rostedt, ryan.roberts, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vishal.moola,
	wangkefeng.wang, will, willy, yang, ziy, zokeefe

When only non-PMD-sized mTHP is enabled (such as only 64K mTHP enabled),
we should also allow kicking khugepaged to attempt scanning and collapsing
64K mTHP. Modify hugepage_pmd_enabled() to support mTHP collapse, and
while we are at it, rename it to make the function name more clear.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/khugepaged.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 16542ecf02dc..155ef8d286e2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -430,7 +430,7 @@ static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
@@ -442,11 +442,11 @@ static bool hugepage_pmd_enabled(void)
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+	if (READ_ONCE(huge_anon_orders_always))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+	if (READ_ONCE(huge_anon_orders_madvise))
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+	if (READ_ONCE(huge_anon_orders_inherit) &&
 	    hugepage_global_enabled())
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -490,7 +490,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  unsigned long vm_flags)
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
-	    hugepage_pmd_enabled()) {
+	    hugepage_enabled()) {
 		unsigned long orders = vma_is_anonymous(vma) ?
 					THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
 
@@ -2711,7 +2711,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+	return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
 }
 
 static int khugepaged_wait_event(void)
@@ -2784,7 +2784,7 @@ static void khugepaged_wait_work(void)
 		return;
 	}
 
-	if (hugepage_pmd_enabled())
+	if (hugepage_enabled())
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
@@ -2815,7 +2815,7 @@ static void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!hugepage_enabled()) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -2865,7 +2865,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (hugepage_enabled()) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2891,7 +2891,7 @@ int start_stop_khugepaged(void)
 void khugepaged_min_free_kbytes_update(void)
 {
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled() && khugepaged_thread)
+	if (hugepage_enabled() && khugepaged_thread)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/12] khugepaged: mTHP support
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (12 preceding siblings ...)
  2025-05-28 12:31 ` [PATCH 1/2] mm: khugepaged: allow khugepaged to check all anonymous mTHP orders Baolin Wang
@ 2025-05-28 12:39 ` Baolin Wang
  2025-05-29  3:52   ` Nico Pache
  2025-06-16  3:51 ` Dev Jain
  14 siblings, 1 reply; 57+ messages in thread
From: Baolin Wang @ 2025-05-28 12:39 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain,
	corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
	peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap



On 2025/5/15 11:22, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse anonymous memory regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. This info is
> tracked using a bitmap. After the PMD scan is done, we do binary recursion
> on the bitmap to find the optimal mTHP sizes for the PMD range. The
> restriction on max_ptes_none is removed during the scan, to make sure we
> account for the whole PMD range. When no mTHP size is enabled, the legacy
> behavior of khugepaged is maintained. max_ptes_none will be scaled by the
> attempted collapse order to determine how full a THP must be to be
> eligible. If a mTHP collapse is attempted, but contains swapped out, or
> shared pages, we dont perform the collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. To exercise mTHP collapse we need to set
> max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
> experience collapse "creep" and constantly promote mTHPs to the next
> available size. This is due the fact that it will introduce at least 2x
> the number of pages, and on a future scan will satisfy that condition once
> again.
> 
> Patch 1:     Refactor/rename hpage_collapse
> Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> Patch 10-11: Tracing/stats
> Patch 12:    Documentation

When I tested 64K mTHP collapse and disabled PMD-sized THP, I found that 
khugepaged couldn't scan and collapse 64K mTHP. I send out two fix 
patches[1], and with these patches applied, 64K mTHP collapse works 
well. I hope my two patches can be folded into your next version series 
if you think there are no issues. Thanks.

[1] 
https://lore.kernel.org/all/ac9ed6d71b439611f9c94b3506a8ce975d4636e9.1748435162.git.baolin.wang@linux.alibaba.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-28  9:26             ` David Hildenbrand
@ 2025-05-28 14:04               ` Baolin Wang
  2025-05-29  4:02                 ` Nico Pache
  0 siblings, 1 reply; 57+ messages in thread
From: Baolin Wang @ 2025-05-28 14:04 UTC (permalink / raw)
  To: David Hildenbrand, Nico Pache, David Rientjes, zokeefe
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, hannes, mhocko, rdunlap



On 2025/5/28 17:26, David Hildenbrand wrote:
> On 22.05.25 11:39, Baolin Wang wrote:
>>
>>
>> On 2025/5/21 18:23, Nico Pache wrote:
>>> On Tue, May 20, 2025 at 4:09 AM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>> Sorry for late reply.
>>>>
>>>> On 2025/5/17 14:47, Nico Pache wrote:
>>>>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2025/5/15 11:22, Nico Pache wrote:
>>>>>>> khugepaged scans anons PMD ranges for potential collapse to a 
>>>>>>> hugepage.
>>>>>>> To add mTHP support we use this scan to instead record chunks of 
>>>>>>> utilized
>>>>>>> sections of the PMD.
>>>>>>>
>>>>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a 
>>>>>>> bitmap
>>>>>>> that represents chunks of utilized regions. We can then determine 
>>>>>>> what
>>>>>>> mTHP size fits best and in the following patch, we set this 
>>>>>>> bitmap while
>>>>>>> scanning the anon PMD. A minimum collapse order of 2 is used as 
>>>>>>> this is
>>>>>>> the lowest order supported by anon memory.
>>>>>>>
>>>>>>> max_ptes_none is used as a scale to determine how "full" an order 
>>>>>>> must
>>>>>>> be before being considered for collapse.
>>>>>>>
>>>>>>> When attempting to collapse an order that has its order set to 
>>>>>>> "always"
>>>>>>> lets always collapse to that order in a greedy manner without
>>>>>>> considering the number of bits set.
>>>>>>>
>>>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>>>>
>>>>>> Sigh. You still haven't addressed or explained the issues I 
>>>>>> previously
>>>>>> raised [1], so I don't know how to review this patch again...
>>>>> Can you still reproduce this issue?
>>>>
>>>> Yes, I can still reproduce this issue with today's (5/20) mm-new 
>>>> branch.
>>>>
>>>> I've disabled PMD-sized THP in my system:
>>>> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
>>>> always madvise [never]
>>>> [root]# cat 
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>> always inherit madvise [never]
>>>>
>>>> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
>>>> and I can still see it collapsing to a PMD-sized THP.
>>> Hi Baolin ! Thank you for your reply and willingness to test again :)
>>>
>>> I didn't realize we were talking about madvise collapse-- this makes
>>> sense now. I also figured out why I could "reproduce" it before. My
>>> script was always enabling the THP settings in two places, and I only
>>> commented out one to test this. But this time I was doing more manual
>>> testing.
>>>
>>> The original design of madvise_collapse ignores the sysfs and
>>> collapses even if you have an order disabled. I believe this behavior
>>> is wrong, but by design. I spent some time playing around with madvise
>>> collapses with and w/o my changes. This is not a new thing, I
>>> reproduced the issue in 6.11 (Fedora 41), and I think its been
>>> possible since the inception of madvise collapse 3 years ago. I
>>> noticed a similar behavior on one of my RFC since it was "breaking"
>>> selftests, and the fix was to reincorporate this broken sysfs
>>> behavior.
>>
>> OK. Thanks for the explanation.
>>
>>> 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage 
>>> collapse")
>>> "This call is independent of the system-wide THP sysfs settings, but
>>> will fail for memory marked VM_NOHUGEPAGE."
>>>
>>> The second condition holds true (and fails for VM_NOHUGEPAGE), but I
>>> dont know if we actually want madvise_collapse to be independent of
>>> the system-wide.
>>
>> This design principle surprised me a bit, and I failed to find the
>> reason in the commit log. I agree that "never should mean never," and we
>> should respect the THP/mTHP sysfs setting. Additionally, for the
>> 'shmem_enabled' sysfs interface controlled for shmem/tmpfs, THP collapse
>> can still be prohibited through the 'deny' configuration. The rules here
>> are somewhat confusing.
> 
> I recall that we decided to overwrite "VM_NOHUGEPAGE", because the 
> assumption is that the same app that triggered MADV_NOHUGEPAGE triggers 
> the collapse. So the app decides on its own behavior.
> 
> Similarly, allowing for collapsing in a VM without VM_HUGEPAGE in the 
> "madvise" mode would be fine.
> 
> But in the "never" case, we should just "never" collapse.

OK. Let's fix the "never" case first. Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/12] khugepaged: mTHP support
  2025-05-28 12:39 ` [PATCH v7 00/12] khugepaged: mTHP support Baolin Wang
@ 2025-05-29  3:52   ` Nico Pache
  0 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-05-29  3:52 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On Wed, May 28, 2025 at 6:39 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/5/15 11:22, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse anonymous memory regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. This info is
> > tracked using a bitmap. After the PMD scan is done, we do binary recursion
> > on the bitmap to find the optimal mTHP sizes for the PMD range. The
> > restriction on max_ptes_none is removed during the scan, to make sure we
> > account for the whole PMD range. When no mTHP size is enabled, the legacy
> > behavior of khugepaged is maintained. max_ptes_none will be scaled by the
> > attempted collapse order to determine how full a THP must be to be
> > eligible. If a mTHP collapse is attempted, but contains swapped out, or
> > shared pages, we dont perform the collapse.
> >
> > With the default max_ptes_none=511, the code should keep its most of its
> > original behavior. To exercise mTHP collapse we need to set
> > max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
> > experience collapse "creep" and constantly promote mTHPs to the next
> > available size. This is due the fact that it will introduce at least 2x
> > the number of pages, and on a future scan will satisfy that condition once
> > again.
> >
> > Patch 1:     Refactor/rename hpage_collapse
> > Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> > Patch 6-9:   The mTHP patches
> > Patch 10-11: Tracing/stats
> > Patch 12:    Documentation
>
> When I tested 64K mTHP collapse and disabled PMD-sized THP, I found that
> khugepaged couldn't scan and collapse 64K mTHP. I send out two fix
> patches[1], and with these patches applied, 64K mTHP collapse works
> well. I hope my two patches can be folded into your next version series
> if you think there are no issues. Thanks.

Thank you for looking into that and fixing it, I had originally
decided to only allow khugepaged to collapse to mTHP if the PMD size
was enabled as well. It was on my todo list :) I'll work on adding
your patches to my set, and do some proper testing again!
>
> [1]
> https://lore.kernel.org/all/ac9ed6d71b439611f9c94b3506a8ce975d4636e9.1748435162.git.baolin.wang@linux.alibaba.com/
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-05-23  6:55   ` Baolin Wang
  2025-05-28  6:57     ` Dev Jain
@ 2025-05-29  4:00     ` Nico Pache
  2025-05-30  3:02       ` Baolin Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-05-29  4:00 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On Fri, May 23, 2025 at 12:55 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/5/15 11:22, Nico Pache wrote:
> > For khugepaged to support different mTHP orders, we must generalize this
> > to check if the PMD is not shared by another VMA and the order is
> > enabled.
> >
> > No functional change in this patch.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 10 +++++-----
> >   1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5457571d505a..0c4d6a02d59c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
> >   static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >                                  bool expect_anon,
> >                                  struct vm_area_struct **vmap,
> > -                                struct collapse_control *cc)
> > +                                struct collapse_control *cc, int order)
> >   {
> >       struct vm_area_struct *vma;
> >       unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> > @@ -934,7 +934,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >
> >       if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>
> Sorry, I missed this before. Should we also change 'PMD_ORDER' to
> 'order' for the thp_vma_suitable_order()?
This was changed since the last version (v5) due to an email from Hugh.
https://lore.kernel.org/lkml/7a81339c-f9e5-a718-fa7f-6e3fb134dca5@google.com/

As I noted in my reply to him, although he was not able to reproduce
an issue due to this, we always need to revalidate the PMD order to
verify the PMD range is not shared by another VMA.

-- Nico
>
> >               return SCAN_ADDRESS_RANGE;
> > -     if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> > +     if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
> >               return SCAN_VMA_CHECK;
> >       /*
> >        * Anon VMA expected, the address may be unmapped then
> > @@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -2782,7 +2782,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >                       mmap_read_lock(mm);
> >                       mmap_locked = true;
> >                       result = hugepage_vma_revalidate(mm, addr, false, &vma,
> > -                                                      cc);
> > +                                                      cc, HPAGE_PMD_ORDER);
> >                       if (result  != SCAN_SUCCEED) {
> >                               last_fail = result;
> >                               goto out_nolock;
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-28 14:04               ` Baolin Wang
@ 2025-05-29  4:02                 ` Nico Pache
  2025-05-29  8:27                   ` Baolin Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-05-29  4:02 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand, David Rientjes, zokeefe, linux-mm, linux-doc,
	linux-kernel, linux-trace-kernel, ziy, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, hannes, mhocko, rdunlap

On Wed, May 28, 2025 at 8:04 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/5/28 17:26, David Hildenbrand wrote:
> > On 22.05.25 11:39, Baolin Wang wrote:
> >>
> >>
> >> On 2025/5/21 18:23, Nico Pache wrote:
> >>> On Tue, May 20, 2025 at 4:09 AM Baolin Wang
> >>> <baolin.wang@linux.alibaba.com> wrote:
> >>>>
> >>>> Sorry for late reply.
> >>>>
> >>>> On 2025/5/17 14:47, Nico Pache wrote:
> >>>>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
> >>>>> <baolin.wang@linux.alibaba.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2025/5/15 11:22, Nico Pache wrote:
> >>>>>>> khugepaged scans anons PMD ranges for potential collapse to a
> >>>>>>> hugepage.
> >>>>>>> To add mTHP support we use this scan to instead record chunks of
> >>>>>>> utilized
> >>>>>>> sections of the PMD.
> >>>>>>>
> >>>>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a
> >>>>>>> bitmap
> >>>>>>> that represents chunks of utilized regions. We can then determine
> >>>>>>> what
> >>>>>>> mTHP size fits best and in the following patch, we set this
> >>>>>>> bitmap while
> >>>>>>> scanning the anon PMD. A minimum collapse order of 2 is used as
> >>>>>>> this is
> >>>>>>> the lowest order supported by anon memory.
> >>>>>>>
> >>>>>>> max_ptes_none is used as a scale to determine how "full" an order
> >>>>>>> must
> >>>>>>> be before being considered for collapse.
> >>>>>>>
> >>>>>>> When attempting to collapse an order that has its order set to
> >>>>>>> "always"
> >>>>>>> lets always collapse to that order in a greedy manner without
> >>>>>>> considering the number of bits set.
> >>>>>>>
> >>>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>>>>>
> >>>>>> Sigh. You still haven't addressed or explained the issues I
> >>>>>> previously
> >>>>>> raised [1], so I don't know how to review this patch again...
> >>>>> Can you still reproduce this issue?
> >>>>
> >>>> Yes, I can still reproduce this issue with today's (5/20) mm-new
> >>>> branch.
> >>>>
> >>>> I've disabled PMD-sized THP in my system:
> >>>> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
> >>>> always madvise [never]
> >>>> [root]# cat
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >>>> always inherit madvise [never]
> >>>>
> >>>> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
> >>>> and I can still see it collapsing to a PMD-sized THP.
> >>> Hi Baolin ! Thank you for your reply and willingness to test again :)
> >>>
> >>> I didn't realize we were talking about madvise collapse-- this makes
> >>> sense now. I also figured out why I could "reproduce" it before. My
> >>> script was always enabling the THP settings in two places, and I only
> >>> commented out one to test this. But this time I was doing more manual
> >>> testing.
> >>>
> >>> The original design of madvise_collapse ignores the sysfs and
> >>> collapses even if you have an order disabled. I believe this behavior
> >>> is wrong, but by design. I spent some time playing around with madvise
> >>> collapses with and w/o my changes. This is not a new thing, I
> >>> reproduced the issue in 6.11 (Fedora 41), and I think its been
> >>> possible since the inception of madvise collapse 3 years ago. I
> >>> noticed a similar behavior on one of my RFC since it was "breaking"
> >>> selftests, and the fix was to reincorporate this broken sysfs
> >>> behavior.
> >>
> >> OK. Thanks for the explanation.
> >>
> >>> 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage
> >>> collapse")
> >>> "This call is independent of the system-wide THP sysfs settings, but
> >>> will fail for memory marked VM_NOHUGEPAGE."
> >>>
> >>> The second condition holds true (and fails for VM_NOHUGEPAGE), but I
> >>> dont know if we actually want madvise_collapse to be independent of
> >>> the system-wide.
> >>
> >> This design principle surprised me a bit, and I failed to find the
> >> reason in the commit log. I agree that "never should mean never," and we
> >> should respect the THP/mTHP sysfs setting. Additionally, for the
> >> 'shmem_enabled' sysfs interface controlled for shmem/tmpfs, THP collapse
> >> can still be prohibited through the 'deny' configuration. The rules here
> >> are somewhat confusing.
> >
> > I recall that we decided to overwrite "VM_NOHUGEPAGE", because the
> > assumption is that the same app that triggered MADV_NOHUGEPAGE triggers
> > the collapse. So the app decides on its own behavior.
> >
> > Similarly, allowing for collapsing in a VM without VM_HUGEPAGE in the
> > "madvise" mode would be fine.
> >
> > But in the "never" case, we should just "never" collapse.
>
> OK. Let's fix the "never" case first. Thanks.
Great, I will update that in the next version!
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-29  4:02                 ` Nico Pache
@ 2025-05-29  8:27                   ` Baolin Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Baolin Wang @ 2025-05-29  8:27 UTC (permalink / raw)
  To: Nico Pache
  Cc: David Hildenbrand, David Rientjes, zokeefe, linux-mm, linux-doc,
	linux-kernel, linux-trace-kernel, ziy, lorenzo.stoakes,
	Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, hannes, mhocko, rdunlap



On 2025/5/29 12:02, Nico Pache wrote:
> On Wed, May 28, 2025 at 8:04 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/5/28 17:26, David Hildenbrand wrote:
>>> On 22.05.25 11:39, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/5/21 18:23, Nico Pache wrote:
>>>>> On Tue, May 20, 2025 at 4:09 AM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>> Sorry for late reply.
>>>>>>
>>>>>> On 2025/5/17 14:47, Nico Pache wrote:
>>>>>>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang
>>>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/5/15 11:22, Nico Pache wrote:
>>>>>>>>> khugepaged scans anons PMD ranges for potential collapse to a
>>>>>>>>> hugepage.
>>>>>>>>> To add mTHP support we use this scan to instead record chunks of
>>>>>>>>> utilized
>>>>>>>>> sections of the PMD.
>>>>>>>>>
>>>>>>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a
>>>>>>>>> bitmap
>>>>>>>>> that represents chunks of utilized regions. We can then determine
>>>>>>>>> what
>>>>>>>>> mTHP size fits best and in the following patch, we set this
>>>>>>>>> bitmap while
>>>>>>>>> scanning the anon PMD. A minimum collapse order of 2 is used as
>>>>>>>>> this is
>>>>>>>>> the lowest order supported by anon memory.
>>>>>>>>>
>>>>>>>>> max_ptes_none is used as a scale to determine how "full" an order
>>>>>>>>> must
>>>>>>>>> be before being considered for collapse.
>>>>>>>>>
>>>>>>>>> When attempting to collapse an order that has its order set to
>>>>>>>>> "always"
>>>>>>>>> lets always collapse to that order in a greedy manner without
>>>>>>>>> considering the number of bits set.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>>>>>>
>>>>>>>> Sigh. You still haven't addressed or explained the issues I
>>>>>>>> previously
>>>>>>>> raised [1], so I don't know how to review this patch again...
>>>>>>> Can you still reproduce this issue?
>>>>>>
>>>>>> Yes, I can still reproduce this issue with today's (5/20) mm-new
>>>>>> branch.
>>>>>>
>>>>>> I've disabled PMD-sized THP in my system:
>>>>>> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled
>>>>>> always madvise [never]
>>>>>> [root]# cat
>>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>>>> always inherit madvise [never]
>>>>>>
>>>>>> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory,
>>>>>> and I can still see it collapsing to a PMD-sized THP.
>>>>> Hi Baolin ! Thank you for your reply and willingness to test again :)
>>>>>
>>>>> I didn't realize we were talking about madvise collapse-- this makes
>>>>> sense now. I also figured out why I could "reproduce" it before. My
>>>>> script was always enabling the THP settings in two places, and I only
>>>>> commented out one to test this. But this time I was doing more manual
>>>>> testing.
>>>>>
>>>>> The original design of madvise_collapse ignores the sysfs and
>>>>> collapses even if you have an order disabled. I believe this behavior
>>>>> is wrong, but by design. I spent some time playing around with madvise
>>>>> collapses with and w/o my changes. This is not a new thing, I
>>>>> reproduced the issue in 6.11 (Fedora 41), and I think its been
>>>>> possible since the inception of madvise collapse 3 years ago. I
>>>>> noticed a similar behavior on one of my RFC since it was "breaking"
>>>>> selftests, and the fix was to reincorporate this broken sysfs
>>>>> behavior.
>>>>
>>>> OK. Thanks for the explanation.
>>>>
>>>>> 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage
>>>>> collapse")
>>>>> "This call is independent of the system-wide THP sysfs settings, but
>>>>> will fail for memory marked VM_NOHUGEPAGE."
>>>>>
>>>>> The second condition holds true (and fails for VM_NOHUGEPAGE), but I
>>>>> dont know if we actually want madvise_collapse to be independent of
>>>>> the system-wide.
>>>>
>>>> This design principle surprised me a bit, and I failed to find the
>>>> reason in the commit log. I agree that "never should mean never," and we
>>>> should respect the THP/mTHP sysfs setting. Additionally, for the
>>>> 'shmem_enabled' sysfs interface controlled for shmem/tmpfs, THP collapse
>>>> can still be prohibited through the 'deny' configuration. The rules here
>>>> are somewhat confusing.
>>>
>>> I recall that we decided to overwrite "VM_NOHUGEPAGE", because the
>>> assumption is that the same app that triggered MADV_NOHUGEPAGE triggers
>>> the collapse. So the app decides on its own behavior.
>>>
>>> Similarly, allowing for collapsing in a VM without VM_HUGEPAGE in the
>>> "madvise" mode would be fine.
>>>
>>> But in the "never" case, we should just "never" collapse.
>>
>> OK. Let's fix the "never" case first. Thanks.
> Great, I will update that in the next version!

I've sent a patchset to fix the MADV_COLLAPSE issue for anonymous memory 
and shmem [1]. Please have a look.

[1] 
https://lore.kernel.org/all/cover.1748506520.git.baolin.wang@linux.alibaba.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-05-29  4:00     ` Nico Pache
@ 2025-05-30  3:02       ` Baolin Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Baolin Wang @ 2025-05-30  3:02 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, dev.jain, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap



On 2025/5/29 12:00, Nico Pache wrote:
> On Fri, May 23, 2025 at 12:55 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/5/15 11:22, Nico Pache wrote:
>>> For khugepaged to support different mTHP orders, we must generalize this
>>> to check if the PMD is not shared by another VMA and the order is
>>> enabled.
>>>
>>> No functional change in this patch.
>>>
>>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>    mm/khugepaged.c | 10 +++++-----
>>>    1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 5457571d505a..0c4d6a02d59c 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
>>>    static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>>>                                   bool expect_anon,
>>>                                   struct vm_area_struct **vmap,
>>> -                                struct collapse_control *cc)
>>> +                                struct collapse_control *cc, int order)
>>>    {
>>>        struct vm_area_struct *vma;
>>>        unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>>> @@ -934,7 +934,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>>>
>>>        if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>>
>> Sorry, I missed this before. Should we also change 'PMD_ORDER' to
>> 'order' for the thp_vma_suitable_order()?
> This was changed since the last version (v5) due to an email from Hugh.
> https://lore.kernel.org/lkml/7a81339c-f9e5-a718-fa7f-6e3fb134dca5@google.com/
> 
> As I noted in my reply to him, although he was not able to reproduce
> an issue due to this, we always need to revalidate the PMD order to
> verify the PMD range is not shared by another VMA.

OK. I see. Better to add some comments like Hugh did to make it clear.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 07/12] khugepaged: add mTHP support
  2025-05-15  3:22 ` [PATCH v7 07/12] khugepaged: add " Nico Pache
@ 2025-06-07  6:23   ` Dev Jain
  2025-06-07 12:55     ` Nico Pache
  2025-06-07 13:03     ` Nico Pache
  0 siblings, 2 replies; 57+ messages in thread
From: Dev Jain @ 2025-06-07  6:23 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap


On 15/05/25 8:52 am, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning PMD ranges for potential collapse candidates, keep track
> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> mTHPs are enabled we remove the restriction of max_ptes_none during the
> scan phase so we dont bailout early and miss potential mTHP candidates.
>
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
>
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
>
> For non PMD collapse we much leave the anon VMA write locked until after
> we collapse the mTHP

Why? I know that Hugh pointed out locking errors; I am yet to catch up
on that thread, but you need to explain in the description why you do
what you do.

[--snip---]

>   	
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { /* mTHP collapse */
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);

Nico,

I've noticed a few occasions where my review comments have not been acknowledged -
for example, [1]. It makes it difficult to follow up and contributes to some
frustration on my end. I'd appreciate if you could make sure to respond to
feedback, even if you are disagreeing with my comments. Thanks!


[1] https://lore.kernel.org/all/08d13445-5ed1-42ea-8aee-c1dbde24407e@arm.com/


[---snip---]


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 07/12] khugepaged: add mTHP support
  2025-06-07  6:23   ` Dev Jain
@ 2025-06-07 12:55     ` Nico Pache
  2025-06-07 13:03     ` Nico Pache
  1 sibling, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-06-07 12:55 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On Sat, Jun 7, 2025 at 12:24 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 15/05/25 8:52 am, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we dont bailout early and miss potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > For non PMD collapse we much leave the anon VMA write locked until after
> > we collapse the mTHP
>
> Why? I know that Hugh pointed out locking errors; I am yet to catch up
> on that thread, but you need to explain in the description why you do
> what you do.
I will add a better description in the next version. The reasoning is
that in the PMD case all the pages are isolated, but in the non-PMD
case this is not true, and we must keep the lock to prevent changes
from occurring after we unlock it.

Another potential solution is to isolate all the pages in the PMD,
then undo it after we collapse the mTHP.

-- Nico
>
> [--snip---]
>
> >
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { /* mTHP collapse */
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
>
> Nico,
>
> I've noticed a few occasions where my review comments have not been acknowledged -
> for example, [1]. It makes it difficult to follow up and contributes to some
> frustration on my end. I'd appreciate if you could make sure to respond to
> feedback, even if you are disagreeing with my comments. Thanks!
>
>
> [1] https://lore.kernel.org/all/08d13445-5ed1-42ea-8aee-c1dbde24407e@arm.com/
>
>
> [---snip---]
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
       [not found]   ` <bc8f72f3-01d9-43db-a632-1f4b9a1d5276@arm.com>
@ 2025-06-07 12:57     ` Nico Pache
  2025-06-07 14:34       ` Dev Jain
  0 siblings, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-06-07 12:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, Bagas Sanjaya

On Sat, Jun 7, 2025 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 15/05/25 8:52 am, Nico Pache wrote:
>
> Now that we can collapse to mTHPs lets update the admin guide to
> reflect these changes and provide proper guidence on how to utilize it.
>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index dff8d5985f0f..5c63fe51b3ad 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
>
>
> We need to modify/remove the following paragraph:
>
> khugepaged currently only searches for opportunities to collapse to
> PMD-sized THP and no attempt is made to collapse to other THP
> sizes.
On this version this is currently still true, but once I add Baolin's
patch it will not be true. Thanks for the reminder :)

-- Nico


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 07/12] khugepaged: add mTHP support
  2025-06-07  6:23   ` Dev Jain
  2025-06-07 12:55     ` Nico Pache
@ 2025-06-07 13:03     ` Nico Pache
  2025-06-07 14:31       ` Dev Jain
  1 sibling, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-06-07 13:03 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On Sat, Jun 7, 2025 at 12:24 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 15/05/25 8:52 am, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we dont bailout early and miss potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > For non PMD collapse we much leave the anon VMA write locked until after
> > we collapse the mTHP
>
> Why? I know that Hugh pointed out locking errors; I am yet to catch up
> on that thread, but you need to explain in the description why you do
> what you do.
>
> [--snip---]
>
> >
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { /* mTHP collapse */
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
>
> Nico,
>
> I've noticed a few occasions where my review comments have not been acknowledged -
> for example, [1]. It makes it difficult to follow up and contributes to some
> frustration on my end. I'd appreciate if you could make sure to respond to
> feedback, even if you are disagreeing with my comments. Thanks!

I'm sorry you feel that way, are there any others? I feel like I've
been pretty good at responding to all comments. I've also been out of
the office for the last month, so keeping up with upstream has been
more difficult, but i'm back now.

Sorry I never got back to you on that one! I will add the BUG_ON, but
I believe it's unnecessary. Your changeset was focused on different
functionality and it seems that you had a bug in it if you were
hitting that often.

Cheers,
-- Nico
>
>
> [1] https://lore.kernel.org/all/08d13445-5ed1-42ea-8aee-c1dbde24407e@arm.com/
>
>
> [---snip---]
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 07/12] khugepaged: add mTHP support
  2025-06-07 13:03     ` Nico Pache
@ 2025-06-07 14:31       ` Dev Jain
  2025-06-07 14:42         ` Dev Jain
  0 siblings, 1 reply; 57+ messages in thread
From: Dev Jain @ 2025-06-07 14:31 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap


On 07/06/25 6:33 pm, Nico Pache wrote:
> On Sat, Jun 7, 2025 at 12:24 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>> On 15/05/25 8:52 am, Nico Pache wrote:
>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>> While scanning PMD ranges for potential collapse candidates, keep track
>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>
>>> After the scan is complete we will perform binary recursion on the
>>> bitmap to determine which mTHP size would be most efficient to collapse
>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>> determine how full a THP must be to be eligible.
>>>
>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>> pages, we dont perform the collapse.
>>>
>>> For non PMD collapse we much leave the anon VMA write locked until after
>>> we collapse the mTHP
>> Why? I know that Hugh pointed out locking errors; I am yet to catch up
>> on that thread, but you need to explain in the description why you do
>> what you do.
>>
>> [--snip---]
>>
>>> -
>>> -     spin_lock(pmd_ptl);
>>> -     BUG_ON(!pmd_none(*pmd));
>>> -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>> -     folio_add_lru_vma(folio, vma);
>>> -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> -     set_pmd_at(mm, address, pmd, _pmd);
>>> -     update_mmu_cache_pmd(vma, address, pmd);
>>> -     deferred_split_folio(folio, false);
>>> -     spin_unlock(pmd_ptl);
>>> +     if (order == HPAGE_PMD_ORDER) {
>>> +             pgtable = pmd_pgtable(_pmd);
>>> +             _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
>>> +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>>> +             BUG_ON(!pmd_none(*pmd));
>>> +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
>>> +             folio_add_lru_vma(folio, vma);
>>> +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> +             set_pmd_at(mm, address, pmd, _pmd);
>>> +             update_mmu_cache_pmd(vma, address, pmd);
>>> +             deferred_split_folio(folio, false);
>>> +             spin_unlock(pmd_ptl);
>>> +     } else { /* mTHP collapse */
>>> +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
>>> +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
>>> +
>>> +             spin_lock(pmd_ptl);
>> Nico,
>>
>> I've noticed a few occasions where my review comments have not been acknowledged -
>> for example, [1]. It makes it difficult to follow up and contributes to some
>> frustration on my end. I'd appreciate if you could make sure to respond to
>> feedback, even if you are disagreeing with my comments. Thanks!
> I'm sorry you feel that way, are there any others? I feel like I've
> been pretty good at responding to all comments. I've also been out of
> the office for the last month, so keeping up with upstream has been
> more difficult, but i'm back now.

No issues, there were others but I don't want to waste our time digging
them up, when we are on the same page!

>
> Sorry I never got back to you on that one! I will add the BUG_ON, but
> I believe it's unnecessary. Your changeset was focused on different
> functionality and it seems that you had a bug in it if you were
> hitting that often.

In my original reply, when I said "I hit the BUG_ON a lot of times",
I meant, during testing. It was quite difficult to extend for non-PMD
sized VMAs, and the BUG_ON was getting hit due to rmap reaching the
non-isolated folios and somehow installing the PMD again. That is
why I say that the BUG_ON is important since it will help us catch
bugs early. And we have that for the PMD case anyways so why not
for mTHP...

>
> Cheers,
> -- Nico
>>
>> [1] https://lore.kernel.org/all/08d13445-5ed1-42ea-8aee-c1dbde24407e@arm.com/
>>
>>
>> [---snip---]
>>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-06-07 12:57     ` Nico Pache
@ 2025-06-07 14:34       ` Dev Jain
  2025-06-08 19:50         ` Nico Pache
  0 siblings, 1 reply; 57+ messages in thread
From: Dev Jain @ 2025-06-07 14:34 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, Bagas Sanjaya


On 07/06/25 6:27 pm, Nico Pache wrote:
> On Sat, Jun 7, 2025 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>> On 15/05/25 8:52 am, Nico Pache wrote:
>>
>> Now that we can collapse to mTHPs lets update the admin guide to
>> reflect these changes and provide proper guidence on how to utilize it.
>>
>> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index dff8d5985f0f..5c63fe51b3ad 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>
>>
>> We need to modify/remove the following paragraph:
>>
>> khugepaged currently only searches for opportunities to collapse to
>> PMD-sized THP and no attempt is made to collapse to other THP
>> sizes.
> On this version this is currently still true, but once I add Baolin's
> patch it will not be true. Thanks for the reminder :)

You referenced Baolin's patch in the other email too, can you send the link,
or the patch?

>
> -- Nico
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 07/12] khugepaged: add mTHP support
  2025-06-07 14:31       ` Dev Jain
@ 2025-06-07 14:42         ` Dev Jain
  0 siblings, 0 replies; 57+ messages in thread
From: Dev Jain @ 2025-06-07 14:42 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap


On 07/06/25 8:01 pm, Dev Jain wrote:
>
> On 07/06/25 6:33 pm, Nico Pache wrote:
>> On Sat, Jun 7, 2025 at 12:24 AM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>> On 15/05/25 8:52 am, Nico Pache wrote:
>>>> Introduce the ability for khugepaged to collapse to different mTHP 
>>>> sizes.
>>>> While scanning PMD ranges for potential collapse candidates, keep 
>>>> track
>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER 
>>>> ptes. If
>>>> mTHPs are enabled we remove the restriction of max_ptes_none during 
>>>> the
>>>> scan phase so we dont bailout early and miss potential mTHP 
>>>> candidates.
>>>>
>>>> After the scan is complete we will perform binary recursion on the
>>>> bitmap to determine which mTHP size would be most efficient to 
>>>> collapse
>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>> determine how full a THP must be to be eligible.
>>>>
>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>> pages, we dont perform the collapse.
>>>>
>>>> For non PMD collapse we much leave the anon VMA write locked until 
>>>> after
>>>> we collapse the mTHP
>>> Why? I know that Hugh pointed out locking errors; I am yet to catch up
>>> on that thread, but you need to explain in the description why you do
>>> what you do.
>>>
>>> [--snip---]
>>>
>>>> -
>>>> -     spin_lock(pmd_ptl);
>>>> -     BUG_ON(!pmd_none(*pmd));
>>>> -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>>> -     folio_add_lru_vma(folio, vma);
>>>> -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>>> -     set_pmd_at(mm, address, pmd, _pmd);
>>>> -     update_mmu_cache_pmd(vma, address, pmd);
>>>> -     deferred_split_folio(folio, false);
>>>> -     spin_unlock(pmd_ptl);
>>>> +     if (order == HPAGE_PMD_ORDER) {
>>>> +             pgtable = pmd_pgtable(_pmd);
>>>> +             _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
>>>> +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>>>> +
>>>> +             spin_lock(pmd_ptl);
>>>> +             BUG_ON(!pmd_none(*pmd));
>>>> +             folio_add_new_anon_rmap(folio, vma, _address, 
>>>> RMAP_EXCLUSIVE);
>>>> +             folio_add_lru_vma(folio, vma);
>>>> +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>>> +             set_pmd_at(mm, address, pmd, _pmd);
>>>> +             update_mmu_cache_pmd(vma, address, pmd);
>>>> +             deferred_split_folio(folio, false);
>>>> +             spin_unlock(pmd_ptl);
>>>> +     } else { /* mTHP collapse */
>>>> +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
>>>> +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
>>>> +
>>>> +             spin_lock(pmd_ptl);
>>> Nico,
>>>
>>> I've noticed a few occasions where my review comments have not been 
>>> acknowledged -
>>> for example, [1]. It makes it difficult to follow up and contributes 
>>> to some
>>> frustration on my end. I'd appreciate if you could make sure to 
>>> respond to
>>> feedback, even if you are disagreeing with my comments. Thanks!
>> I'm sorry you feel that way, are there any others? I feel like I've
>> been pretty good at responding to all comments. I've also been out of
>> the office for the last month, so keeping up with upstream has been
>> more difficult, but i'm back now.
>
> No issues, there were others but I don't want to waste our time digging
> them up, when we are on the same page!

To be clear, those others were when we were debating about your and my 
method,

that is why I said it is a waste of time revisiting them :)


>
>>
>> Sorry I never got back to you on that one! I will add the BUG_ON, but
>> I believe it's unnecessary. Your changeset was focused on different
>> functionality and it seems that you had a bug in it if you were
>> hitting that often.
>
> In my original reply, when I said "I hit the BUG_ON a lot of times",
> I meant, during testing. It was quite difficult to extend for non-PMD
> sized VMAs, and the BUG_ON was getting hit due to rmap reaching the
> non-isolated folios and somehow installing the PMD again. That is
> why I say that the BUG_ON is important since it will help us catch
> bugs early. And we have that for the PMD case anyways so why not
> for mTHP...
>
>>
>> Cheers,
>> -- Nico
>>>
>>> [1] 
>>> https://lore.kernel.org/all/08d13445-5ed1-42ea-8aee-c1dbde24407e@arm.com/
>>>
>>>
>>> [---snip---]
>>>
>>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-06-07 14:34       ` Dev Jain
@ 2025-06-08 19:50         ` Nico Pache
  2025-06-09  3:06           ` Baolin Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-06-08 19:50 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, Bagas Sanjaya

On Sat, Jun 7, 2025 at 8:35 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 07/06/25 6:27 pm, Nico Pache wrote:
> > On Sat, Jun 7, 2025 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >> On 15/05/25 8:52 am, Nico Pache wrote:
> >>
> >> Now that we can collapse to mTHPs lets update the admin guide to
> >> reflect these changes and provide proper guidence on how to utilize it.
> >>
> >> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> >> Signed-off-by: Nico Pache <npache@redhat.com>
> >> ---
> >>   Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
> >>   1 file changed, 13 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >> index dff8d5985f0f..5c63fe51b3ad 100644
> >> --- a/Documentation/admin-guide/mm/transhuge.rst
> >> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >>
> >>
> >> We need to modify/remove the following paragraph:
> >>
> >> khugepaged currently only searches for opportunities to collapse to
> >> PMD-sized THP and no attempt is made to collapse to other THP
> >> sizes.
> > On this version this is currently still true, but once I add Baolin's
> > patch it will not be true. Thanks for the reminder :)
>
> You referenced Baolin's patch in the other email too, can you send the link,
> or the patch?

He didn't send it to the mailing list, but rather off chain to all the
recipients of this series. You should have it in your email look for

Subject: "mm: khugepaged: allow khugepaged to check all anonymous mTHP
orders" and "mm: khugepaged: kick khugepaged for enabling
none-PMD-sized mTHPs"

-- Nico
>
> >
> > -- Nico
> >
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-06-08 19:50         ` Nico Pache
@ 2025-06-09  3:06           ` Baolin Wang
  2025-06-09  5:26             ` Dev Jain
  2025-06-09  5:56             ` Nico Pache
  0 siblings, 2 replies; 57+ messages in thread
From: Baolin Wang @ 2025-06-09  3:06 UTC (permalink / raw)
  To: Nico Pache, Dev Jain
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet, rostedt,
	mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, Bagas Sanjaya



On 2025/6/9 03:50, Nico Pache wrote:
> On Sat, Jun 7, 2025 at 8:35 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> On 07/06/25 6:27 pm, Nico Pache wrote:
>>> On Sat, Jun 7, 2025 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>> On 15/05/25 8:52 am, Nico Pache wrote:
>>>>
>>>> Now that we can collapse to mTHPs lets update the admin guide to
>>>> reflect these changes and provide proper guidence on how to utilize it.
>>>>
>>>> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>> ---
>>>>    Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
>>>>    1 file changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>> index dff8d5985f0f..5c63fe51b3ad 100644
>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>>
>>>>
>>>> We need to modify/remove the following paragraph:
>>>>
>>>> khugepaged currently only searches for opportunities to collapse to
>>>> PMD-sized THP and no attempt is made to collapse to other THP
>>>> sizes.
>>> On this version this is currently still true, but once I add Baolin's
>>> patch it will not be true. Thanks for the reminder :)
>>
>> You referenced Baolin's patch in the other email too, can you send the link,
>> or the patch?
> 
> He didn't send it to the mailing list, but rather off chain to all the
> recipients of this series. You should have it in your email look for
> 
> Subject: "mm: khugepaged: allow khugepaged to check all anonymous mTHP
> orders" and "mm: khugepaged: kick khugepaged for enabling
> none-PMD-sized mTHPs"

You can find them at the following link:
https://lore.kernel.org/all/ac9ed6d71b439611f9c94b3506a8ce975d4636e9.1748435162.git.baolin.wang@linux.alibaba.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-06-09  3:06           ` Baolin Wang
@ 2025-06-09  5:26             ` Dev Jain
  2025-06-09  6:39               ` Baolin Wang
  2025-06-09  5:56             ` Nico Pache
  1 sibling, 1 reply; 57+ messages in thread
From: Dev Jain @ 2025-06-09  5:26 UTC (permalink / raw)
  To: Baolin Wang, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet, rostedt,
	mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, Bagas Sanjaya


On 09/06/25 8:36 am, Baolin Wang wrote:
>
>
> On 2025/6/9 03:50, Nico Pache wrote:
>> On Sat, Jun 7, 2025 at 8:35 AM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>> On 07/06/25 6:27 pm, Nico Pache wrote:
>>>> On Sat, Jun 7, 2025 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>
>>>>> On 15/05/25 8:52 am, Nico Pache wrote:
>>>>>
>>>>> Now that we can collapse to mTHPs lets update the admin guide to
>>>>> reflect these changes and provide proper guidence on how to 
>>>>> utilize it.
>>>>>
>>>>> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>>> ---
>>>>>    Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
>>>>>    1 file changed, 13 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst 
>>>>> b/Documentation/admin-guide/mm/transhuge.rst
>>>>> index dff8d5985f0f..5c63fe51b3ad 100644
>>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>>>
>>>>>
>>>>> We need to modify/remove the following paragraph:
>>>>>
>>>>> khugepaged currently only searches for opportunities to collapse to
>>>>> PMD-sized THP and no attempt is made to collapse to other THP
>>>>> sizes.
>>>> On this version this is currently still true, but once I add Baolin's
>>>> patch it will not be true. Thanks for the reminder :)
>>>
>>> You referenced Baolin's patch in the other email too, can you send 
>>> the link,
>>> or the patch?
>>
>> He didn't send it to the mailing list, but rather off chain to all the
>> recipients of this series. You should have it in your email look for
>>
>> Subject: "mm: khugepaged: allow khugepaged to check all anonymous mTHP
>> orders" and "mm: khugepaged: kick khugepaged for enabling
>> none-PMD-sized mTHPs"
>
> You can find them at the following link:
> https://lore.kernel.org/all/ac9ed6d71b439611f9c94b3506a8ce975d4636e9.1748435162.git.baolin.wang@linux.alibaba.com/ 
>


Thanks! Looks quite similar to my approach.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-06-09  3:06           ` Baolin Wang
  2025-06-09  5:26             ` Dev Jain
@ 2025-06-09  5:56             ` Nico Pache
  1 sibling, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-06-09  5:56 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	david, ziy, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, Bagas Sanjaya

On Sun, Jun 8, 2025 at 9:06 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/6/9 03:50, Nico Pache wrote:
> > On Sat, Jun 7, 2025 at 8:35 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >> On 07/06/25 6:27 pm, Nico Pache wrote:
> >>> On Sat, Jun 7, 2025 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
> >>>>
> >>>> On 15/05/25 8:52 am, Nico Pache wrote:
> >>>>
> >>>> Now that we can collapse to mTHPs lets update the admin guide to
> >>>> reflect these changes and provide proper guidence on how to utilize it.
> >>>>
> >>>> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> >>>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>>> ---
> >>>>    Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
> >>>>    1 file changed, 13 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >>>> index dff8d5985f0f..5c63fe51b3ad 100644
> >>>> --- a/Documentation/admin-guide/mm/transhuge.rst
> >>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >>>>
> >>>>
> >>>> We need to modify/remove the following paragraph:
> >>>>
> >>>> khugepaged currently only searches for opportunities to collapse to
> >>>> PMD-sized THP and no attempt is made to collapse to other THP
> >>>> sizes.
> >>> On this version this is currently still true, but once I add Baolin's
> >>> patch it will not be true. Thanks for the reminder :)
> >>
> >> You referenced Baolin's patch in the other email too, can you send the link,
> >> or the patch?
> >
> > He didn't send it to the mailing list, but rather off chain to all the
> > recipients of this series. You should have it in your email look for
> >
> > Subject: "mm: khugepaged: allow khugepaged to check all anonymous mTHP
> > orders" and "mm: khugepaged: kick khugepaged for enabling
> > none-PMD-sized mTHPs"
>
> You can find them at the following link:
> https://lore.kernel.org/all/ac9ed6d71b439611f9c94b3506a8ce975d4636e9.1748435162.git.baolin.wang@linux.alibaba.com/
Ah whoops you did send them on-chain. I'm so used to seeing the
mailing lists as the first recipients on the CC list xD
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-06-09  5:26             ` Dev Jain
@ 2025-06-09  6:39               ` Baolin Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Baolin Wang @ 2025-06-09  6:39 UTC (permalink / raw)
  To: Dev Jain, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet, rostedt,
	mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, Bagas Sanjaya



On 2025/6/9 13:26, Dev Jain wrote:
> 
> On 09/06/25 8:36 am, Baolin Wang wrote:
>>
>>
>> On 2025/6/9 03:50, Nico Pache wrote:
>>> On Sat, Jun 7, 2025 at 8:35 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>>
>>>> On 07/06/25 6:27 pm, Nico Pache wrote:
>>>>> On Sat, Jun 7, 2025 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>>
>>>>>> On 15/05/25 8:52 am, Nico Pache wrote:
>>>>>>
>>>>>> Now that we can collapse to mTHPs lets update the admin guide to
>>>>>> reflect these changes and provide proper guidence on how to 
>>>>>> utilize it.
>>>>>>
>>>>>> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
>>>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>>>> ---
>>>>>>    Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
>>>>>>    1 file changed, 13 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst 
>>>>>> b/Documentation/admin-guide/mm/transhuge.rst
>>>>>> index dff8d5985f0f..5c63fe51b3ad 100644
>>>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>>>>
>>>>>>
>>>>>> We need to modify/remove the following paragraph:
>>>>>>
>>>>>> khugepaged currently only searches for opportunities to collapse to
>>>>>> PMD-sized THP and no attempt is made to collapse to other THP
>>>>>> sizes.
>>>>> On this version this is currently still true, but once I add Baolin's
>>>>> patch it will not be true. Thanks for the reminder :)
>>>>
>>>> You referenced Baolin's patch in the other email too, can you send 
>>>> the link,
>>>> or the patch?
>>>
>>> He didn't send it to the mailing list, but rather off chain to all the
>>> recipients of this series. You should have it in your email look for
>>>
>>> Subject: "mm: khugepaged: allow khugepaged to check all anonymous mTHP
>>> orders" and "mm: khugepaged: kick khugepaged for enabling
>>> none-PMD-sized mTHPs"
>>
>> You can find them at the following link:
>> https://lore.kernel.org/all/ac9ed6d71b439611f9c94b3506a8ce975d4636e9.1748435162.git.baolin.wang@linux.alibaba.com/
> 
> 
> Thanks! Looks quite similar to my approach.

Sorry, I didn't see your previous patches (if I had, it would have saved 
me a lot of testing time:) ).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/12] khugepaged: mTHP support
  2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
                   ` (13 preceding siblings ...)
  2025-05-28 12:39 ` [PATCH v7 00/12] khugepaged: mTHP support Baolin Wang
@ 2025-06-16  3:51 ` Dev Jain
  2025-06-16 15:51   ` Nico Pache
  14 siblings, 1 reply; 57+ messages in thread
From: Dev Jain @ 2025-06-16  3:51 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
	baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, anshuman.khandual, catalin.marinas, tiwai, will,
	dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
	mhocko, rdunlap


On 15/05/25 8:52 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse anonymous memory regions to mTHPs.

Hi Nico,

Can you tell the expected date of posting v8 of this patchset?

>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/12] khugepaged: mTHP support
  2025-06-16  3:51 ` Dev Jain
@ 2025-06-16 15:51   ` Nico Pache
  2025-06-16 16:35     ` Dev Jain
  0 siblings, 1 reply; 57+ messages in thread
From: Nico Pache @ 2025-06-16 15:51 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap

On Sun, Jun 15, 2025 at 9:52 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 15/05/25 8:52 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse anonymous memory regions to mTHPs.
>
> Hi Nico,
Hey Dev!

>
> Can you tell the expected date of posting v8 of this patchset?
Hopefully by next week, although it may be longer (as a try to catch
up on everything after PTO). We were originally targeting 6.16, but we
missed that window-- so I need to repost for 6.17, which we have
plenty of time for. Ive also been releasing them slower as previously
I was not giving reviewers enough time to actually review between my
different versions (and this creates a lot of noise in people's
inboxes).

I'm also going through some of the testing again, this time with
redis-memtier (as David suggested).

Cheers,
-- Nico
>
> >
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/12] khugepaged: mTHP support
  2025-06-16 15:51   ` Nico Pache
@ 2025-06-16 16:35     ` Dev Jain
  0 siblings, 0 replies; 57+ messages in thread
From: Dev Jain @ 2025-06-16 16:35 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts, corbet,
	rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
	wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap


On 16/06/25 9:21 pm, Nico Pache wrote:
> On Sun, Jun 15, 2025 at 9:52 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> On 15/05/25 8:52 am, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse anonymous memory regions to mTHPs.
>> Hi Nico,
> Hey Dev!
>
>> Can you tell the expected date of posting v8 of this patchset?
> Hopefully by next week, although it may be longer (as a try to catch
> up on everything after PTO). We were originally targeting 6.16, but we
> missed that window-- so I need to repost for 6.17, which we have
> plenty of time for. Ive also been releasing them slower as previously
> I was not giving reviewers enough time to actually review between my
> different versions (and this creates a lot of noise in people's
> inboxes).
>
> I'm also going through some of the testing again, this time with
> redis-memtier (as David suggested).

Sure!

>
> Cheers,
> -- Nico
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-05-16 17:30   ` Liam R. Howlett
@ 2025-06-29  6:48     ` Nico Pache
  0 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-06-29  6:48 UTC (permalink / raw)
  To: Liam R. Howlett, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, david, ziy, baolin.wang, lorenzo.stoakes,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

On Fri, May 16, 2025 at 11:36 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Nico Pache <npache@redhat.com> [250514 23:23]:
> > functions in khugepaged.c use a mix of hpage_collapse and khugepaged
> > as the function prefix.
> >
> > rename all of them to khugepaged to keep things consistent and slightly
> > shorten the function names.
>
> I don't like what was done here, we've lost the context of what these
> functions are used for (collapse). Are they used for other things
> besides collapse?
Hi Liam,

Most of the renamed functions are used by the daemon to determine the
state of the khugepaged operations. You could argue that
madvise_collapse is not part of the daemon, and a couple of them are
both khugepaged and madvise_collapse, but what I do in the subsequent
patch is *mostly* unify madvise_collapse and khugepaged.

I personally believe this rename makes sense (and it was recommended
by David H.)



-- Nico

>
> I'd rather drop the prefix entirely than drop collapse from them all.
> They are all static, so do we really need khugepaged_ at the start of
> every static function in khugepaged.c?
>
>
> >
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 42 +++++++++++++++++++++---------------------
> >  1 file changed, 21 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index cdf5a581368b..806bcd8c5185 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
> >       kmem_cache_destroy(mm_slot_cache);
> >  }
> >
> > -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> > +static inline int khugepaged_test_exit(struct mm_struct *mm)
> >  {
> >       return atomic_read(&mm->mm_users) == 0;
> >  }
> >
> > -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> > +static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
> >  {
> > -     return hpage_collapse_test_exit(mm) ||
> > +     return khugepaged_test_exit(mm) ||
> >              test_bit(MMF_DISABLE_THP, &mm->flags);
> >  }
> >
> > @@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
> >       int wakeup;
> >
> >       /* __khugepaged_exit() must not run from under us */
> > -     VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> > +     VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
> >       if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
> >               return;
> >
> > @@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
> >       } else if (mm_slot) {
> >               /*
> >                * This is required to serialize against
> > -              * hpage_collapse_test_exit() (which is guaranteed to run
> > +              * khugepaged_test_exit() (which is guaranteed to run
> >                * under mmap sem read mode). Stop here (after we return all
> >                * pagetables will be destroyed) until khugepaged has finished
> >                * working on the pagetables under the mmap_lock.
> > @@ -851,7 +851,7 @@ struct collapse_control khugepaged_collapse_control = {
> >       .is_khugepaged = true,
> >  };
> >
> > -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> > +static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> >  {
> >       int i;
> >
> > @@ -886,7 +886,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> > +static int khugepaged_find_target_node(struct collapse_control *cc)
> >  {
> >       int nid, target_node = 0, max_value = 0;
> >
> > @@ -905,7 +905,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
> >       return target_node;
> >  }
> >  #else
> > -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> > +static int khugepaged_find_target_node(struct collapse_control *cc)
> >  {
> >       return 0;
> >  }
> > @@ -925,7 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >       struct vm_area_struct *vma;
> >       unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >
> > -     if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > +     if (unlikely(khugepaged_test_exit_or_disable(mm)))
> >               return SCAN_ANY_PROCESS;
> >
> >       *vmap = vma = find_vma(mm, address);
> > @@ -992,7 +992,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> >
> >  /*
> >   * Bring missing pages in from swap, to complete THP collapse.
> > - * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> > + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> >   *
> >   * Called and returns without pte mapped or spinlocks held.
> >   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> > @@ -1078,7 +1078,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >  {
> >       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> >                    GFP_TRANSHUGE);
> > -     int node = hpage_collapse_find_target_node(cc);
> > +     int node = khugepaged_find_target_node(cc);
> >       struct folio *folio;
> >
> >       folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> > @@ -1264,7 +1264,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       return result;
> >  }
> >
> > -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > +static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                                  struct vm_area_struct *vma,
> >                                  unsigned long address, bool *mmap_locked,
> >                                  struct collapse_control *cc)
> > @@ -1378,7 +1378,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >                * hit record.
> >                */
> >               node = folio_nid(folio);
> > -             if (hpage_collapse_scan_abort(node, cc)) {
> > +             if (khugepaged_scan_abort(node, cc)) {
> >                       result = SCAN_SCAN_ABORT;
> >                       goto out_unmap;
> >               }
> > @@ -1447,7 +1447,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> >
> >       lockdep_assert_held(&khugepaged_mm_lock);
> >
> > -     if (hpage_collapse_test_exit(mm)) {
> > +     if (khugepaged_test_exit(mm)) {
> >               /* free mm_slot */
> >               hash_del(&slot->hash);
> >               list_del(&slot->mm_node);
> > @@ -1740,7 +1740,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> >                       continue;
> >
> > -             if (hpage_collapse_test_exit(mm))
> > +             if (khugepaged_test_exit(mm))
> >                       continue;
> >               /*
> >                * When a vma is registered with uffd-wp, we cannot recycle
> > @@ -2262,7 +2262,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >  }
> >
> > -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > +static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
> >                                   struct file *file, pgoff_t start,
> >                                   struct collapse_control *cc)
> >  {
> > @@ -2307,7 +2307,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >               }
> >
> >               node = folio_nid(folio);
> > -             if (hpage_collapse_scan_abort(node, cc)) {
> > +             if (khugepaged_scan_abort(node, cc)) {
> >                       result = SCAN_SCAN_ABORT;
> >                       break;
> >               }
> > @@ -2391,7 +2391,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >               goto breakouterloop_mmap_lock;
> >
> >       progress++;
> > -     if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > +     if (unlikely(khugepaged_test_exit_or_disable(mm)))
> >               goto breakouterloop;
> >
> >       vma_iter_init(&vmi, mm, khugepaged_scan.address);
> > @@ -2399,7 +2399,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >               unsigned long hstart, hend;
> >
> >               cond_resched();
> > -             if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> > +             if (unlikely(khugepaged_test_exit_or_disable(mm))) {
> >                       progress++;
> >                       break;
> >               }
> > @@ -2421,7 +2421,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                       bool mmap_locked = true;
> >
> >                       cond_resched();
> > -                     if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > +                     if (unlikely(khugepaged_test_exit_or_disable(mm)))
> >                               goto breakouterloop;
> >
> >                       VM_BUG_ON(khugepaged_scan.address < hstart ||
> > @@ -2481,7 +2481,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >        * Release the current mm_slot if this mm is about to die, or
> >        * if we scanned all vmas of this mm.
> >        */
> > -     if (hpage_collapse_test_exit(mm) || !vma) {
> > +     if (khugepaged_test_exit(mm) || !vma) {
> >               /*
> >                * Make sure that if mm_users is reaching zero while
> >                * khugepaged runs here, khugepaged_exit will find
> > --
> > 2.49.0
> >
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-05-16 17:14   ` Liam R. Howlett
@ 2025-06-29  6:52     ` Nico Pache
  0 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-06-29  6:52 UTC (permalink / raw)
  To: Liam R. Howlett, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, david, ziy, baolin.wang, lorenzo.stoakes,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

On Fri, May 16, 2025 at 11:15 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Nico Pache <npache@redhat.com> [250514 23:23]:
> > For khugepaged to support different mTHP orders, we must generalize this
> > to check if the PMD is not shared by another VMA and the order is
> > enabled.
> >
> > No functional change in this patch.
>
> This patch needs to be with the functional change for git blame and
> reviewing the changes.
I don't think that is the case. I've seen many series' that piecemeal
their changes including separating out nonfunctional changes before
the actual functional change. A lot of small changes were required to
generalize this for mTHP collapse. Doing it all in one patch would
have made the mTHP support patch huge and noisy. I tried to make that
patch cleaner (for review purposes) by separating out some of the
noise.


-- Nico
>
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5457571d505a..0c4d6a02d59c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
> >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >                                  bool expect_anon,
> >                                  struct vm_area_struct **vmap,
> > -                                struct collapse_control *cc)
> > +                                struct collapse_control *cc, int order)
> >  {
> >       struct vm_area_struct *vma;
> >       unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> > @@ -934,7 +934,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >
> >       if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> >               return SCAN_ADDRESS_RANGE;
> > -     if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> > +     if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
> >               return SCAN_VMA_CHECK;
> >       /*
> >        * Anon VMA expected, the address may be unmapped then
> > @@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -2782,7 +2782,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >                       mmap_read_lock(mm);
> >                       mmap_locked = true;
> >                       result = hugepage_vma_revalidate(mm, addr, false, &vma,
> > -                                                      cc);
> > +                                                      cc, HPAGE_PMD_ORDER);
> >                       if (result  != SCAN_SUCCEED) {
> >                               last_fail = result;
> >                               goto out_nolock;
> > --
> > 2.49.0
> >
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-05-16 17:12   ` Liam R. Howlett
@ 2025-07-02  0:00     ` Nico Pache
  0 siblings, 0 replies; 57+ messages in thread
From: Nico Pache @ 2025-07-02  0:00 UTC (permalink / raw)
  To: Liam R. Howlett, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, david, ziy, baolin.wang, lorenzo.stoakes,
	ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
	mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap

On Fri, May 16, 2025 at 11:13 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Nico Pache <npache@redhat.com> [250514 23:23]:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create khugepaged_collapse_single_pmd to increase code
> > reuse and create an entry point for future khugepaged changes.
> >
> > Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> > the new khugepaged_collapse_single_pmd function.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
> >  1 file changed, 49 insertions(+), 47 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 806bcd8c5185..5457571d505a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >  }
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static int khugepaged_collapse_single_pmd(unsigned long addr,
> > +                                struct vm_area_struct *vma, bool *mmap_locked,
> > +                                struct collapse_control *cc)
> > +{
> > +     int result = SCAN_FAIL;
> > +     struct mm_struct *mm = vma->vm_mm;
> > +
> > +     if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
>
> why IS_ENABLED(CONFIG_SHMEM) here, it seems new?
Fixed in the next version. It was a mishandled rebase conflict.
>
> > +             struct file *file = get_file(vma->vm_file);
> > +             pgoff_t pgoff = linear_page_index(vma, addr);
> > +
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> > +             result = khugepaged_scan_file(mm, addr, file, pgoff, cc);
> > +             fput(file);
> > +             if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > +                     mmap_read_lock(mm);
> > +                     *mmap_locked = true;
> > +                     if (khugepaged_test_exit_or_disable(mm)) {
> > +                             result = SCAN_ANY_PROCESS;
> > +                             goto end;
> > +                     }
> > +                     result = collapse_pte_mapped_thp(mm, addr,
> > +                                                      !cc->is_khugepaged);
> > +                     if (result == SCAN_PMD_MAPPED)
> > +                             result = SCAN_SUCCEED;
> > +                     mmap_read_unlock(mm);
> > +                     *mmap_locked = false;
> > +             }
> > +     } else {
> > +             result = khugepaged_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > +     }
> > +     if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > +             ++khugepaged_pages_collapsed;
> > +end:
> > +     return result;
>
> This function can return with mmap_read_locked or unlocked..
>
> > +}
> > +
> >  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                                           struct collapse_control *cc)
> >       __releases(&khugepaged_mm_lock)
> > @@ -2427,34 +2469,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                       VM_BUG_ON(khugepaged_scan.address < hstart ||
> >                                 khugepaged_scan.address + HPAGE_PMD_SIZE >
> >                                 hend);
> > -                     if (!vma_is_anonymous(vma)) {
> > -                             struct file *file = get_file(vma->vm_file);
> > -                             pgoff_t pgoff = linear_page_index(vma,
> > -                                             khugepaged_scan.address);
> > -
> > -                             mmap_read_unlock(mm);
> > -                             mmap_locked = false;
> > -                             *result = hpage_collapse_scan_file(mm,
> > -                                     khugepaged_scan.address, file, pgoff, cc);
> > -                             fput(file);
> > -                             if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > -                                     mmap_read_lock(mm);
> > -                                     if (hpage_collapse_test_exit_or_disable(mm))
> > -                                             goto breakouterloop;
> > -                                     *result = collapse_pte_mapped_thp(mm,
> > -                                             khugepaged_scan.address, false);
> > -                                     if (*result == SCAN_PMD_MAPPED)
> > -                                             *result = SCAN_SUCCEED;
> > -                                     mmap_read_unlock(mm);
> > -                             }
> > -                     } else {
> > -                             *result = hpage_collapse_scan_pmd(mm, vma,
> > -                                     khugepaged_scan.address, &mmap_locked, cc);
> > -                     }
> > -
> > -                     if (*result == SCAN_SUCCEED)
> > -                             ++khugepaged_pages_collapsed;
> >
> > +                     *ngle_pmd(khugepaged_scan.address,
> > +                                             vma, &mmap_locked, cc);
> > +                     /* If we return SCAN_ANY_PROCESS we are holding the mmap_lock */
>
> But this comment makes it obvious that you know that..
>
> > +                     if (*result == SCAN_ANY_PROCESS)
> > +                             goto breakouterloop;
>
> But later..
>
> breakouterloop:
>         mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
> breakouterloop_mmap_lock:
>
>
> So if you return with SCAN_ANY_PROCESS, we are holding the lock and go
> immediately and drop it.  This seems unnecessarily complicated and
> involves a lock.
SCAN_ANY_PROCESS indicates that the process we are working on has
either exited, or THPs have been disabled mid-scan. So we have to drop
the lock regardless.
>
> That would leave just the khugepaged_scan_pmd() path with the
> unfortunate locking mess - which is a static function and called in one
> location..
>
> Looking at what happens after the return seems to indicate we could
> clean that up as well, sometime later.
I see your point, all other instances handle the unlock within their
own function and this one should too. instead of handling the unlock
in the parent function I should just return with it unlocked and have
the already established if(!mmap_locked) do the cleanup.
>
> >                       /* move to next address */
> >                       khugepaged_scan.address += HPAGE_PMD_SIZE;
> >                       progress += HPAGE_PMD_NR;
> > @@ -2773,36 +2793,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >               mmap_assert_locked(mm);
> >               memset(cc->node_load, 0, sizeof(cc->node_load));
> >               nodes_clear(cc->alloc_nmask);
> > -             if (!vma_is_anonymous(vma)) {
> > -                     struct file *file = get_file(vma->vm_file);
> > -                     pgoff_t pgoff = linear_page_index(vma, addr);
> >
> > -                     mmap_read_unlock(mm);
> > -                     mmap_locked = false;
> > -                     result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > -                                                       cc);
> > -                     fput(file);
> > -             } else {
> > -                     result = hpage_collapse_scan_pmd(mm, vma, addr,
> > -                                                      &mmap_locked, cc);
> > -             }
> > +             result = khugepaged_collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > +
> >               if (!mmap_locked)
> >                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> >
> > -handle_result:
> >               switch (result) {
> >               case SCAN_SUCCEED:
> >               case SCAN_PMD_MAPPED:
> >                       ++thps;
> >                       break;
> >               case SCAN_PTE_MAPPED_HUGEPAGE:
> > -                     BUG_ON(mmap_locked);
> > -                     BUG_ON(*prev);
> > -                     mmap_read_lock(mm);
> > -                     result = collapse_pte_mapped_thp(mm, addr, true);
> > -                     mmap_read_unlock(mm);
> > -                     goto handle_result;
>
> All of the above should probably be replaced with a BUG_ON(1) since it's
> not expected now?  Or at least WARN_ON_ONCE(), but it should be safe to
> continue if that's the case.
I dont think we should warn as this is the return value for indicating
that we are trying to collapse to a mTHP that is smaller than the
already established folio (see __collapse_huge_page_isolate), but
continuing should be ok.
>
> It looks like the mmap_locked boolean is used to ensure that *prev is
> safe, but we are now dropping the lock and re-acquiring it (and
> potentially returning here) with it set to true, so perv will not be set
> to NULL like it should.
Luckily Lorenzo just cleaned this up with the madvise code changes he
made, but yes you are correct.
>
> I think you can handle this by ensuring that
> khugepaged_collapse_single_pmd() returns with mmap_locked false in the
> SCAN_ANY_PROCESS case.
>
> > -             /* Whitelisted set of results where continuing OK */
>
> This seems worth keeping?
I'll add that back, thanks.
>
> >               case SCAN_PMD_NULL:
> >               case SCAN_PTE_NON_PRESENT:
> >               case SCAN_PTE_UFFD_WP:
>
> I guess SCAN_ANY_PROCESS should be handled by the default case
> statement?  It should probably be added to the switch?
I believe it should be handled by the default case, since we dont want
to continue, so we break out as intended.
>
> That is to say, before your change the result would come from either
> hpage_collapse_scan_file(), then lead to collapse_pte_mapped_thp()
> above.
In the khugepaged case we do the following check
(khugepaged_test_exit_or_disable) before calling pte_mapped_thp, but
we weren't doing it in the madvise_collapse case. seems like we had a
bug lingering or unnecessary code in the original implementation (its
been that way since day 1). I can note the slight difference in the
commit log. I believe having the same check for both is wise, although
now I have to ask why we arent using the revalidate function like all
other callers do when they drop the lock. I will note this small
difference in the commit log, and will invest some time in the future
into cleaning up this madness. I think unifying these two callers into
one, as I'm trying to do here, will make these behavioral deviations
harder in the future, and we can have sanity knowing there is *mostly*
one way to call the collapse.
>
> Now, you can have khugepaged_test_exit_or_disable() happen to return
> SCAN_ANY_PROCESS and it will fall through to the default in this switch
> statement, which seems like new behaviour?
>
> At the very least, this information should be added to the git log on
> what this patch does - if it's expected?
Will do, thanks for the thought provoking review, I had to do some
digging to verify this one :)

-- Nico
>
> Thanks,
> Liam
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2025-07-02  0:00 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-15  3:22 [PATCH v7 00/12] khugepaged: mTHP support Nico Pache
2025-05-15  3:22 ` [PATCH v7 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-05-16 17:30   ` Liam R. Howlett
2025-06-29  6:48     ` Nico Pache
2025-05-15  3:22 ` [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-05-15  5:50   ` Baolin Wang
2025-05-16 11:59     ` Nico Pache
2025-05-16 17:12   ` Liam R. Howlett
2025-07-02  0:00     ` Nico Pache
2025-05-15  3:22 ` [PATCH v7 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-05-16 17:14   ` Liam R. Howlett
2025-06-29  6:52     ` Nico Pache
2025-05-23  6:55   ` Baolin Wang
2025-05-28  6:57     ` Dev Jain
2025-05-29  4:00     ` Nico Pache
2025-05-30  3:02       ` Baolin Wang
2025-05-15  3:22 ` [PATCH v7 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-05-15  3:22 ` [PATCH v7 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-05-15  3:22 ` [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-05-16  3:20   ` Baolin Wang
2025-05-17  6:47     ` Nico Pache
2025-05-18  3:04       ` Liam R. Howlett
2025-05-20 10:09       ` Baolin Wang
2025-05-20 10:26         ` David Hildenbrand
2025-05-21  1:03           ` Baolin Wang
2025-05-21 10:23         ` Nico Pache
2025-05-22  9:39           ` Baolin Wang
2025-05-28  9:26             ` David Hildenbrand
2025-05-28 14:04               ` Baolin Wang
2025-05-29  4:02                 ` Nico Pache
2025-05-29  8:27                   ` Baolin Wang
2025-05-15  3:22 ` [PATCH v7 07/12] khugepaged: add " Nico Pache
2025-06-07  6:23   ` Dev Jain
2025-06-07 12:55     ` Nico Pache
2025-06-07 13:03     ` Nico Pache
2025-06-07 14:31       ` Dev Jain
2025-06-07 14:42         ` Dev Jain
2025-05-15  3:22 ` [PATCH v7 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-05-15  3:22 ` [PATCH v7 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-05-15  3:22 ` [PATCH v7 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-05-15  3:22 ` [PATCH v7 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-05-15  3:22 ` [PATCH v7 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-05-15  4:40   ` Randy Dunlap
     [not found]   ` <bc8f72f3-01d9-43db-a632-1f4b9a1d5276@arm.com>
2025-06-07 12:57     ` Nico Pache
2025-06-07 14:34       ` Dev Jain
2025-06-08 19:50         ` Nico Pache
2025-06-09  3:06           ` Baolin Wang
2025-06-09  5:26             ` Dev Jain
2025-06-09  6:39               ` Baolin Wang
2025-06-09  5:56             ` Nico Pache
2025-05-28 12:31 ` [PATCH 1/2] mm: khugepaged: allow khugepaged to check all anonymous mTHP orders Baolin Wang
2025-05-28 12:31   ` [PATCH 2/2] mm: khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Baolin Wang
2025-05-28 12:39 ` [PATCH v7 00/12] khugepaged: mTHP support Baolin Wang
2025-05-29  3:52   ` Nico Pache
2025-06-16  3:51 ` Dev Jain
2025-06-16 15:51   ` Nico Pache
2025-06-16 16:35     ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).