[PATCH v5 00/12] khugepaged: mTHP support

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/12] khugepaged: mTHP support
@ 2025-04-28 18:12 Nico Pache
  2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
                   ` (11 more replies)
  0 siblings, 12 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

The following series provides khugepaged and madvise collapse with the
capability to collapse anonymous memory regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. This info is
tracked using a bitmap. After the PMD scan is done, we do binary recursion
on the bitmap to find the optimal mTHP sizes for the PMD range. The
restriction on max_ptes_none is removed during the scan, to make sure we
account for the whole PMD range. When no mTHP size is enabled, the legacy
behavior of khugepaged is maintained. max_ptes_none will be scaled by the
attempted collapse order to determine how full a THP must be to be
eligible. If a mTHP collapse is attempted, but contains swapped out, or
shared pages, we dont perform the collapse.

With the default max_ptes_none=511, the code should keep its most of its
original behavior. To exercise mTHP collapse we need to set
max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
experience collapse "creep" and constantly promote mTHPs to the next
available size.

Patch 1:     Refactor/rename hpage_collapse
Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
Patch 3-5:   Generalize khugepaged functions for arbitrary orders
Patch 6-9:   The mTHP patches
Patch 10-11: Tracing/stats
Patch 12:    Documentation

---------
 Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits
   while monitoring a number of stats and tracepoints. The code is
   available here[1] (Run in legacy mode for these changes and set mthp
   sizes to inherit)
   The summary from my testings was that there was no significant
   regression noticed through this test. In some cases my changes had
   better collapse latencies, and was able to scan more pages in the same
   amount of time/work, but for the most part the results were consistent.
- redis testing. I tested these changes along with my defer changes
  (see followup post for more details).
- some basic testing on 64k page size.
- lots of general use.

Changes since V4 [2]:
- switched the order of patches 1 and 2
- fixed some edge cases on the unified madvise_collapse and khugepaged
- Explained the "creep" some more in the docs
- fix EXCEED_SHARED vs EXCEED_SWAP accounting issue
- fix potential highmem issue caused by a early unmap of the PTE

Changes since V3:
- Rebased onto mm-unstable
   commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
- small changes to Documentation

Changes since V2:
- corrected legacy behavior for khugepaged and madvise_collapse
- added proper mTHP stat tracking
- Minor changes to prevent a nested lock on non-split-lock arches
- Took Devs version of alloc_charge_folio as it has the proper stats
- Skip cases were trying to collapse to a lower order would still fail
- Fixed cases were the bitmap was not being updated properly
- Moved Documentation update to this series instead of the defer set
- Minor bugs discovered during testing and review
- Minor "nit" cleanup


Changes since V1:
- Minor bug fixes discovered during review and testing
- removed dynamic allocations for bitmaps, and made them stack based
- Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
- Updated trace events to include collapsing order info.
- Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
- No longer require a chunk to be fully utilized before setting the bit.
   Use the same max_ptes_none scaling principle to achieve this.
- Skip mTHP collapse that requires swapin or shared handling. This helps
   prevent some of the "creep" that was discovered in v1.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250417000238.74567-1-npache@redhat.com/

Dev Jain (1):
  khugepaged: generalize alloc_charge_folio()

Nico Pache (11):
  khugepaged: rename hpage_collapse_* to khugepaged_*
  introduce khugepaged_collapse_single_pmd to unify khugepaged and
    madvise_collapse
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: skip collapsing mTHP to smaller orders
  khugepaged: avoid unnecessary mTHP collapse attempts
  khugepaged: improve tracepoints for mTHP orders
  khugepaged: add per-order mTHP khugepaged stats
  Documentation: mm: update the admin guide for mTHP collapse

 Documentation/admin-guide/mm/transhuge.rst |  14 +-
 include/linux/huge_mm.h                    |   5 +
 include/linux/khugepaged.h                 |   4 +
 include/trace/events/huge_memory.h         |  34 +-
 mm/huge_memory.c                           |  11 +
 mm/khugepaged.c                            | 464 ++++++++++++++-------
 6 files changed, 376 insertions(+), 156 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-29 13:41   ` Zi Yan
  2025-04-30  7:47   ` Baolin Wang
  2025-04-28 18:12 ` [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

functions in khugepaged.c use a mix of hpage_collapse and khugepaged
as the function prefix.

rename all of them to khugepaged to keep things consistent and slightly
shorten the function names.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cf204ab6af0..8b96274dce07 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
 	kmem_cache_destroy(mm_slot_cache);
 }
 
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
 {
-	return hpage_collapse_test_exit(mm) ||
+	return khugepaged_test_exit(mm) ||
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
@@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 	int wakeup;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
 		return;
 
@@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * khugepaged_test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we return all
 		 * pagetables will be destroyed) until khugepaged has finished
 		 * working on the pagetables under the mmap_lock.
@@ -851,7 +851,7 @@ struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -886,7 +886,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -905,7 +905,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -925,7 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -992,7 +992,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1078,7 +1078,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
-	int node = hpage_collapse_find_target_node(cc);
+	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
 	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1264,7 +1264,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
 				   struct collapse_control *cc)
@@ -1378,7 +1378,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1447,7 +1447,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (khugepaged_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
@@ -1742,7 +1742,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 
-		if (hpage_collapse_test_exit(mm))
+		if (khugepaged_test_exit(mm))
 			continue;
 		/*
 		 * When a vma is registered with uffd-wp, we cannot recycle
@@ -2264,7 +2264,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2309,7 +2309,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2355,7 +2355,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 #else
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2401,7 +2401,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		goto breakouterloop;
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2409,7 +2409,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+		if (unlikely(khugepaged_test_exit_or_disable(mm))) {
 			progress++;
 			break;
 		}
@@ -2431,7 +2431,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(khugepaged_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2491,7 +2491,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (khugepaged_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
  2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-30  7:56   ` Baolin Wang
  2025-04-28 18:12 ` [PATCH v5 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.

Create khugepaged_collapse_single_pmd to increase code
reuse and create an entry point for future khugepaged changes.

Refactor madvise_collapse and khugepaged_scan_mm_slot to use
the new khugepaged_collapse_single_pmd function.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8b96274dce07..3690a6ec5883 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2363,6 +2363,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int khugepaged_collapse_single_pmd(unsigned long addr,
+				   struct vm_area_struct *vma, bool *mmap_locked,
+				   struct collapse_control *cc)
+{
+	int result = SCAN_FAIL;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
+		struct file *file = get_file(vma->vm_file);
+		pgoff_t pgoff = linear_page_index(vma, addr);
+
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+		result = khugepaged_scan_file(mm, addr, file, pgoff, cc);
+		fput(file);
+		if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+			mmap_read_lock(mm);
+			*mmap_locked = true;
+			if (khugepaged_test_exit_or_disable(mm)) {
+				result = SCAN_ANY_PROCESS;
+				goto end;
+			}
+			result = collapse_pte_mapped_thp(mm, addr,
+							 !cc->is_khugepaged);
+			if (result == SCAN_PMD_MAPPED)
+				result = SCAN_SUCCEED;
+			mmap_read_unlock(mm);
+			*mmap_locked = false;
+		}
+	} else {
+		result = khugepaged_scan_pmd(mm, vma, addr, mmap_locked, cc);
+	}
+	if (cc->is_khugepaged && result == SCAN_SUCCEED)
+		++khugepaged_pages_collapsed;
+end:
+	return result;
+}
+
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2437,34 +2479,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
-
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
 
+			*result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
+						vma, &mmap_locked, cc);
+			/* If we return SCAN_ANY_PROCESS we are holding the mmap_lock */
+			if (*result == SCAN_ANY_PROCESS)
+				goto breakouterloop;
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
@@ -2783,36 +2803,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
 		nodes_clear(cc->alloc_nmask);
-		if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
-			fput(file);
-		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
-		}
+		result = khugepaged_collapse_single_pmd(addr, vma, &mmap_locked, cc);
+
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
 		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			BUG_ON(*prev);
-			mmap_read_lock(mm);
-			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
-		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
  2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
  2025-04-28 18:12 ` [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-28 18:12 ` [PATCH v5 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

For khugepaged to support different mTHP orders, we must generalize this
function for arbitrary orders.

No functional change in this patch.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3690a6ec5883..ddb805356e05 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   bool expect_anon,
 				   struct vm_area_struct **vmap,
-				   struct collapse_control *cc)
+				   struct collapse_control *cc, int order)
 {
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
@@ -932,9 +932,9 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
+	if (!thp_vma_suitable_order(vma, address, order))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1130,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1164,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2792,7 +2792,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 cc, HPAGE_PMD_ORDER);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 04/12] khugepaged: generalize alloc_charge_folio()
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (2 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-28 18:12 ` [PATCH v5 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

From: Dev Jain <dev.jain@arm.com>

Pass order to alloc_charge_folio() and update mTHP statistics.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/huge_mm.h |  2 ++
 mm/huge_memory.c        |  4 ++++
 mm/khugepaged.c         | 17 +++++++++++------
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..0bb65bd4e6dd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -123,6 +123,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_COLLAPSE_ALLOC,
+	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..670b8b7a6738 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -680,6 +682,8 @@ static struct attribute *any_stats_attrs[] = {
 #endif
 	&split_attr.attr,
 	&split_failed_attr.attr,
+	&collapse_alloc_attr.attr,
+	&collapse_alloc_failed_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ddb805356e05..16b5b3761f95 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,21 +1074,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, u8 order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (order == HPAGE_PMD_ORDER)
+			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
-	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (order == HPAGE_PMD_ORDER)
+		count_vm_event(THP_COLLAPSE_ALLOC);
+	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
@@ -1125,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1849,7 +1854,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (3 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-28 18:12 ` [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.

mTHP collapse can suffer from incosistant behavior, and memory waste
"creep". disable swapin and shared support for mTHP collapse.

No functional changes in this patch.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 46 ++++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 16b5b3761f95..e21998a06253 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
 					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+					struct list_head *compound_pagelist,
+					u8 order)
 {
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= scaled_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
+			    shared > khugepaged_max_ptes_shared)) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -711,13 +713,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist,
+						u8 order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
@@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 					     pmd_t *pmd,
 					     pmd_t orig_pmd,
 					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+					     struct list_head *compound_pagelist,
+					     u8 order)
 {
 	spinlock_t *pmd_ptl;
 
@@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
 }
 
 /*
@@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, u8 order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 compound_pagelist, order);
 
 	return result;
 }
@@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
 				       struct vm_area_struct *vma,
 				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+				       int referenced, u8 order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = haddr + (PAGE_SIZE << order);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1035,6 +1039,12 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		if (!is_swap_pte(vmf.orig_pte))
 			continue;
 
+		/* Dont swapin for mTHP collapse */
+		if (order != HPAGE_PMD_ORDER) {
+			result = SCAN_EXCEED_SWAP_PTE;
+			goto out;
+		}
+
 		vmf.pte = pte;
 		vmf.ptl = ptl;
 		ret = do_swap_page(&vmf);
@@ -1154,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+				referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1201,7 +1211,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist);
+					&compound_pagelist, HPAGE_PMD_ORDER);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1231,7 +1241,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (4 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-30 10:08   ` Baolin Wang
  2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

khugepaged scans anons PMD ranges for potential collapse to a hugepage.
To add mTHP support we use this scan to instead record chunks of utilized
sections of the PMD.

khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
that represents chunks of utilized regions. We can then determine what
mTHP size fits best and in the following patch, we set this bitmap while
scanning the anon PMD.

max_ptes_none is used as a scale to determine how "full" an order must
be before being considered for collapse.

When attempting to collapse an order that has its order set to "always"
lets always collapse to that order in a greedy manner without
considering the number of bits set.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/khugepaged.h |  4 ++
 mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 9 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 1f46046080f5..18fe6eb5051d 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -1,6 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
+#define KHUGEPAGED_MIN_MTHP_ORDER	2
+#define KHUGEPAGED_MIN_MTHP_NR	(1<<KHUGEPAGED_MIN_MTHP_ORDER)
+#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
+#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
 
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e21998a06253..6e67db86409a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+struct scan_bit_state {
+	u8 order;
+	u16 offset;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -102,6 +107,18 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/*
+	 * bitmap used to collapse mTHP sizes.
+	 * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
+	 */
+	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
+	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
+};
+
+struct collapse_control khugepaged_collapse_control = {
+	.is_khugepaged = true,
 };
 
 /**
@@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-struct collapse_control khugepaged_collapse_control = {
-	.is_khugepaged = true,
-};
-
 static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
@@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, bool *mmap_locked,
+				  u8 order, u16 offset)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * The allocation can take potentially a long time if it involves
 	 * sync compaction, and we do not need to hold the mmap_lock during
 	 * that. We will recheck the vma after taking it again in write mode.
+	 * If collapsing mTHPs we may have already released the read_lock.
 	 */
-	mmap_read_unlock(mm);
+	if (*mmap_locked) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+	}
 
 	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
@@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
+	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
 	return result;
 }
 
+// Recursive function to consume the bitmap
+static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
+			int referenced, int unmapped, struct collapse_control *cc,
+			bool *mmap_locked, unsigned long enabled_orders)
+{
+	u8 order, next_order;
+	u16 offset, mid_offset;
+	int num_chunks;
+	int bits_set, threshold_bits;
+	int top = -1;
+	int collapsed = 0;
+	int ret;
+	struct scan_bit_state state;
+	bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
+	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+		{ HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
+
+	while (top >= 0) {
+		state = cc->mthp_bitmap_stack[top--];
+		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
+		offset = state.offset;
+		num_chunks = 1 << (state.order);
+		// Skip mTHP orders that are not enabled
+		if (!test_bit(order, &enabled_orders))
+			goto next;
+
+		// copy the relavant section to a new bitmap
+		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
+				  MTHP_BITMAP_SIZE);
+
+		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
+		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
+				>> (HPAGE_PMD_ORDER - state.order);
+
+		//Check if the region is "almost full" based on the threshold
+		if (bits_set > threshold_bits || is_pmd_only
+			|| test_bit(order, &huge_anon_orders_always)) {
+			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
+					mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
+			if (ret == SCAN_SUCCEED) {
+				collapsed += (1 << order);
+				continue;
+			}
+		}
+
+next:
+		if (state.order > 0) {
+			next_order = state.order - 1;
+			mid_offset = offset + (num_chunks / 2);
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, mid_offset };
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, offset };
+			}
+	}
+	return collapsed;
+}
+
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
@@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 07/12] khugepaged: add mTHP support
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (5 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-30 20:51   ` Jann Horn
  2025-05-01 12:58   ` Hugh Dickins
  2025-04-28 18:12 ` [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning PMD ranges for potential collapse candidates, keep track
of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
mTHPs are enabled we remove the restriction of max_ptes_none during the
scan phase so we dont bailout early and miss potential mTHP candidates.

After the scan is complete we will perform binary recursion on the
bitmap to determine which mTHP size would be most efficient to collapse
to. max_ptes_none will be scaled by the attempted collapse order to
determine how full a THP must be to be eligible.

If a mTHP collapse is attempted, but contains swapped out, or shared
pages, we dont perform the collapse.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 125 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 88 insertions(+), 37 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6e67db86409a..3a846cd70c66 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1136,13 +1136,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	unsigned long _address = address + offset * PAGE_SIZE;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1158,12 +1159,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		*mmap_locked = false;
 	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1181,13 +1183,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-				referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
+				referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1197,7 +1200,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
+				_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+
 	/*
 	 * This removes any huge TLB entry from the CPU so we won't allow
 	 * huge and small TLB entries for the same virtual address to
@@ -1226,18 +1230,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-					&compound_pagelist, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
+					&compound_pagelist, order);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1258,9 +1260,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   &compound_pagelist, HPAGE_PMD_ORDER);
-	pte_unmap(pte);
+					   vma, _address, pte_ptl,
+					   &compound_pagelist, order);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
 
@@ -1270,25 +1271,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
-	_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
-	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	set_pmd_at(mm, address, pmd, _pmd);
-	update_mmu_cache_pmd(vma, address, pmd);
-	deferred_split_folio(folio, false);
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
+		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, address, pmd, _pmd);
+		update_mmu_cache_pmd(vma, address, pmd);
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	} else { /* mTHP collapse */
+		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		folio_ref_add(folio, (1 << order) - 1);
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
+		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
+
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
 out_up_write:
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
 out_nolock:
 	*mmap_locked = false;
@@ -1364,31 +1382,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
+	int i;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long _address;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
+	bool is_pmd_only;
 	bool writable = false;
-
+	int chunk_none_count = 0;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+		tva_flags, THP_ORDERS_ALL_ANON);
+
+	is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, _address += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		/*
+		 * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
+		 * there are pages in this chunk keep track of it in the bitmap
+		 * for mTHP collapsing.
+		 */
+		if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
+			if (chunk_none_count <= scaled_none)
+				bitmap_set(cc->mthp_bitmap,
+					   i / KHUGEPAGED_MIN_MTHP_NR, 1);
+
+			chunk_none_count = 0;
+		}
+
+		_pte = pte + i;
+		_address = address + i * PAGE_SIZE;
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
@@ -1411,10 +1456,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			}
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			++chunk_none_count;
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			    (!cc->is_khugepaged || !is_pmd_only ||
+				none_or_zero <= khugepaged_max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1510,6 +1556,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 								     address)))
 			referenced++;
 	}
+
 	if (!writable) {
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
@@ -1522,8 +1569,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
+		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
+			       mmap_locked, enabled_orders);
+		if (result > 0)
+			result = SCAN_SUCCEED;
+		else
+			result = SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (6 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-30 10:09   ` Baolin Wang
  2025-04-28 18:12 ` [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3a846cd70c66..86d1153ce9e8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -625,7 +625,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
-		/* See hpage_collapse_scan_pmd(). */
+		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
+
+		/* See khugepaged_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
 			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (7 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-30 10:12   ` Baolin Wang
  2025-04-28 18:12 ` [PATCH v5 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 86d1153ce9e8..5e6732cccb86 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1365,6 +1365,23 @@ static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
 				collapsed += (1 << order);
 				continue;
 			}
+			/*
+			 * Some ret values indicate all lower order will also
+			 * fail, dont trying to collapse smaller orders
+			 */
+			if (ret == SCAN_EXCEED_NONE_PTE ||
+				ret == SCAN_EXCEED_SWAP_PTE ||
+				ret == SCAN_EXCEED_SHARED_PTE ||
+				ret == SCAN_PTE_NON_PRESENT ||
+				ret == SCAN_PTE_UFFD_WP ||
+				ret == SCAN_ALLOC_HUGE_PAGE_FAIL ||
+				ret == SCAN_CGROUP_CHARGE_FAIL ||
+				ret == SCAN_COPY_MC ||
+				ret == SCAN_PAGE_LOCK ||
+				ret == SCAN_PAGE_COUNT)
+				goto next;
+			else
+				break;
 		}
 
 next:
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 10/12] khugepaged: improve tracepoints for mTHP orders
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (8 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-28 18:12 ` [PATCH v5 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
  2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
  11 siblings, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

Add the order to the tracepoints to give better insight into what order
is being operated at for khugepaged.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
 mm/khugepaged.c                    | 10 +++++----
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 9d5c00b0285c..ea2fe20a39f5 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -92,34 +92,37 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 TRACE_EVENT(mm_collapse_huge_page,
 
-	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+	TP_PROTO(struct mm_struct *mm, int isolated, int status, int order),
 
-	TP_ARGS(mm, isolated, status),
+	TP_ARGS(mm, isolated, status, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, isolated)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->isolated = isolated;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, isolated=%d, status=%s",
+	TP_printk("mm=%p, isolated=%d, status=%s order=%d",
 		__entry->mm,
 		__entry->isolated,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct page *page, int none_or_zero,
-		 int referenced, bool  writable, int status),
+		 int referenced, bool  writable, int status, int order),
 
-	TP_ARGS(page, none_or_zero, referenced, writable, status),
+	TP_ARGS(page, none_or_zero, referenced, writable, status, order),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
@@ -127,6 +130,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__field(int, referenced)
 		__field(bool, writable)
 		__field(int, status)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -135,27 +139,31 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->referenced = referenced;
 		__entry->writable = writable;
 		__entry->status = status;
+		__entry->order = order;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s order=%d",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
 		__entry->writable,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->order)
 );
 
 TRACE_EVENT(mm_collapse_huge_page_swapin,
 
-	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+			int order),
 
-	TP_ARGS(mm, swapped_in, referenced, ret),
+	TP_ARGS(mm, swapped_in, referenced, ret, order),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(int, swapped_in)
 		__field(int, referenced)
 		__field(int, ret)
+		__field(int, order)
 	),
 
 	TP_fast_assign(
@@ -163,13 +171,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->swapped_in = swapped_in;
 		__entry->referenced = referenced;
 		__entry->ret = ret;
+		__entry->order = order;
 	),
 
-	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+	TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%d",
 		__entry->mm,
 		__entry->swapped_in,
 		__entry->referenced,
-		__entry->ret)
+		__entry->ret,
+		__entry->order)
 );
 
 TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5e6732cccb86..1d99ef184c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -721,13 +721,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero,
-						    referenced, writable, result);
+						    referenced, writable, result,
+						    order);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero,
-					    referenced, writable, result);
+					    referenced, writable, result, order);
 	return result;
 }
 
@@ -1097,7 +1098,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 	result = SCAN_SUCCEED;
 out:
-	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+						order);
 	return result;
 }
 
@@ -1317,7 +1319,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
-	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
 	return result;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 11/12] khugepaged: add per-order mTHP khugepaged stats
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (9 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
  11 siblings, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

With mTHP support inplace, let add the per-order mTHP stats for
exceeding NONE, SWAP, and SHARED.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/huge_mm.h |  3 +++
 mm/huge_memory.c        |  7 +++++++
 mm/khugepaged.c         | 16 +++++++++++++---
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0bb65bd4e6dd..e3d15c737008 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -139,6 +139,9 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT_DEFERRED,
 	MTHP_STAT_NR_ANON,
 	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+	MTHP_STAT_COLLAPSE_EXCEED_NONE,
+	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 670b8b7a6738..8af5caa0d9bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,6 +633,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
 DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
 DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -649,6 +653,9 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_deferred_attr.attr,
 	&nr_anon_attr.attr,
 	&nr_anon_partially_mapped_attr.attr,
+	&collapse_exceed_swap_pte_attr.attr,
+	&collapse_exceed_none_pte_attr.attr,
+	&collapse_exceed_shared_pte_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1d99ef184c3f..812181354c46 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -604,7 +604,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				if (order == HPAGE_PMD_ORDER)
+					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				else
+					count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
 				goto out;
 			}
 		}
@@ -633,8 +636,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		/* See khugepaged_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
-			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared)) {
+			if (order != HPAGE_PMD_ORDER) {
+				result = SCAN_EXCEED_SHARED_PTE;
+				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+				goto out;
+			}
+
+			if (cc->is_khugepaged &&
+				shared > khugepaged_max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -1060,6 +1069,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 		/* Dont swapin for mTHP collapse */
 		if (order != HPAGE_PMD_ORDER) {
+			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
 			result = SCAN_EXCEED_SWAP_PTE;
 			goto out;
 		}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
                   ` (10 preceding siblings ...)
  2025-04-28 18:12 ` [PATCH v5 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
@ 2025-04-28 18:12 ` Nico Pache
  2025-04-29  5:10   ` Bagas Sanjaya
  2025-04-29 16:38   ` Christoph Lameter (Ampere)
  11 siblings, 2 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-28 18:12 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidence on how to utilize it.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index dff8d5985f0f..5c63fe51b3ad 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,7 @@ often.
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -144,6 +144,18 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
 sizes, the kernel will select the most appropriate enabled size for a
 given allocation.
 
+khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size
+to determine collapses. When using mTHPs it's recommended to set
+max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page
+size). This will prevent undesired "creep" behavior that leads to
+continuously collapsing to a larger mTHP size; When we collapse, we are
+bringing in new non-zero pages that will, on a subsequent scan, cause the
+max_ptes_none check of the +1 order to always be satisfied. By limiting
+this to less than half the current order, we make sure we don't cause this
+feedback loop. max_ptes_shared and max_ptes_swap have no effect when
+collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out
+pages.
+
 It's also possible to limit defrag efforts in the VM to generate
 anonymous hugepages in case they're not immediately free to madvise
 regions or to never try to defrag memory and simply fallback to regular
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2025-04-29  5:10   ` Bagas Sanjaya
  2025-04-29 16:38   ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 41+ messages in thread
From: Bagas Sanjaya @ 2025-04-29  5:10 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

On Mon, Apr 28, 2025 at 12:12:18PM -0600, Nico Pache wrote:
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index dff8d5985f0f..5c63fe51b3ad 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -63,7 +63,7 @@ often.
>  THP can be enabled system wide or restricted to certain tasks or even
>  memory ranges inside task's address space. Unless THP is completely
>  disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages.
>  
>  The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
>  interface and using madvise(2) and prctl(2) system calls.
> @@ -144,6 +144,18 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
>  sizes, the kernel will select the most appropriate enabled size for a
>  given allocation.
>  
> +khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size
> +to determine collapses. When using mTHPs it's recommended to set
> +max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page
> +size). This will prevent undesired "creep" behavior that leads to
> +continuously collapsing to a larger mTHP size; When we collapse, we are
> +bringing in new non-zero pages that will, on a subsequent scan, cause the
> +max_ptes_none check of the +1 order to always be satisfied. By limiting
> +this to less than half the current order, we make sure we don't cause this
> +feedback loop. max_ptes_shared and max_ptes_swap have no effect when
> +collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out
> +pages.
> +
>  It's also possible to limit defrag efforts in the VM to generate
>  anonymous hugepages in case they're not immediately free to madvise
>  regions or to never try to defrag memory and simply fallback to regular

Looks good, thanks!

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
@ 2025-04-29 13:41   ` Zi Yan
  2025-04-30  7:47   ` Baolin Wang
  1 sibling, 0 replies; 41+ messages in thread
From: Zi Yan @ 2025-04-29 13:41 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

On 28 Apr 2025, at 14:12, Nico Pache wrote:

> functions in khugepaged.c use a mix of hpage_collapse and khugepaged
> as the function prefix.
>
> rename all of them to khugepaged to keep things consistent and slightly
> shorten the function names.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 44 ++++++++++++++++++++++----------------------
>  1 file changed, 22 insertions(+), 22 deletions(-)
>
LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
  2025-04-29  5:10   ` Bagas Sanjaya
@ 2025-04-29 16:38   ` Christoph Lameter (Ampere)
  2025-04-29 17:15     ` David Hildenbrand
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-04-29 16:38 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, jglisse, surenb,
	zokeefe, hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes,
	Liam.Howlett

On Mon, 28 Apr 2025, Nico Pache wrote:

>  THP can be enabled system wide or restricted to certain tasks or even
>  memory ranges inside task's address space. Unless THP is completely
>  disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages.

huge pages usually have a fixed size like 2M and are tied to the page
table levels.

Would it not be advisable to use a different term here like "large folio"
or "mTHP sized folio" or something like that?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse
  2025-04-29 16:38   ` Christoph Lameter (Ampere)
@ 2025-04-29 17:15     ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-04-29 17:15 UTC (permalink / raw)
  To: Christoph Lameter (Ampere), Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, baohua, baolin.wang,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On 29.04.25 18:38, Christoph Lameter (Ampere) wrote:
> On Mon, 28 Apr 2025, Nico Pache wrote:
> 
>>   THP can be enabled system wide or restricted to certain tasks or even
>>   memory ranges inside task's address space. Unless THP is completely
>>   disabled, there is ``khugepaged`` daemon that scans memory and
>> -collapses sequences of basic pages into PMD-sized huge pages.
>> +collapses sequences of basic pages into huge pages.
> 
> huge pages usually have a fixed size like 2M and are tied to the page
> table levels.
> 
> Would it not be advisable to use a different term here like "large folio"
> or "mTHP sized folio" or something like that?

"Traditional THPs or mTHPs (both, currently represented as "large 
folios" in the kernel).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
  2025-04-29 13:41   ` Zi Yan
@ 2025-04-30  7:47   ` Baolin Wang
  1 sibling, 0 replies; 41+ messages in thread
From: Baolin Wang @ 2025-04-30  7:47 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett



On 2025/4/29 02:12, Nico Pache wrote:
> functions in khugepaged.c use a mix of hpage_collapse and khugepaged
> as the function prefix.
> 
> rename all of them to khugepaged to keep things consistent and slightly
> shorten the function names.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
  2025-04-28 18:12 ` [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-04-30  7:56   ` Baolin Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Baolin Wang @ 2025-04-30  7:56 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett



On 2025/4/29 02:12, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
> 
> Create khugepaged_collapse_single_pmd to increase code
> reuse and create an entry point for future khugepaged changes.
> 
> Refactor madvise_collapse and khugepaged_scan_mm_slot to use
> the new khugepaged_collapse_single_pmd function.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-04-28 18:12 ` [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-04-30 10:08   ` Baolin Wang
  2025-04-30 18:56     ` Nico Pache
  0 siblings, 1 reply; 41+ messages in thread
From: Baolin Wang @ 2025-04-30 10:08 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett



On 2025/4/29 02:12, Nico Pache wrote:
> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> To add mTHP support we use this scan to instead record chunks of utilized
> sections of the PMD.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of utilized regions. We can then determine what
> mTHP size fits best and in the following patch, we set this bitmap while
> scanning the anon PMD.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> When attempting to collapse an order that has its order set to "always"
> lets always collapse to that order in a greedy manner without
> considering the number of bits set.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   include/linux/khugepaged.h |  4 ++
>   mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
>   2 files changed, 89 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 1f46046080f5..18fe6eb5051d 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,6 +1,10 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
>   #ifndef _LINUX_KHUGEPAGED_H
>   #define _LINUX_KHUGEPAGED_H
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2

Still better to add some comments to explain explicitly why choose 2 as 
the MIN_MTHP_ORDER.

> +#define KHUGEPAGED_MIN_MTHP_NR	(1<<KHUGEPAGED_MIN_MTHP_ORDER)
> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
>   
>   extern unsigned int khugepaged_max_ptes_none __read_mostly;
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e21998a06253..6e67db86409a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>   
>   static struct kmem_cache *mm_slot_cache __ro_after_init;
>   
> +struct scan_bit_state {
> +	u8 order;
> +	u16 offset;
> +};
> +
>   struct collapse_control {
>   	bool is_khugepaged;
>   
> @@ -102,6 +107,18 @@ struct collapse_control {
>   
>   	/* nodemask for allocation fallback */
>   	nodemask_t alloc_nmask;
> +
> +	/*
> +	 * bitmap used to collapse mTHP sizes.
> +	 * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> +	 */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> +	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> +	.is_khugepaged = true,
>   };
>   
>   /**
> @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
>   	remove_wait_queue(&khugepaged_wait, &wait);
>   }
>   
> -struct collapse_control khugepaged_collapse_control = {
> -	.is_khugepaged = true,
> -};
> -
>   static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>   {
>   	int i;
> @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>   
>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, bool *mmap_locked,
> +				  u8 order, u16 offset)
>   {
>   	LIST_HEAD(compound_pagelist);
>   	pmd_t *pmd, _pmd;
> @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * The allocation can take potentially a long time if it involves
>   	 * sync compaction, and we do not need to hold the mmap_lock during
>   	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>   	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}
>   
>   	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)
> @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   out_up_write:
>   	mmap_write_unlock(mm);
>   out_nolock:
> +	*mmap_locked = false;
>   	if (folio)
>   		folio_put(folio);
>   	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>   	return result;
>   }
>   
> +// Recursive function to consume the bitmap

Nit: please use '/* Xxxx */' for comments in this patch.

> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	u8 order, next_order;
> +	u16 offset, mid_offset;
> +	int num_chunks;
> +	int bits_set, threshold_bits;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +	bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> +		offset = state.offset;
> +		num_chunks = 1 << (state.order);
> +		// Skip mTHP orders that are not enabled
> +		if (!test_bit(order, &enabled_orders))
> +			goto next;
> +
> +		// copy the relavant section to a new bitmap
> +		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> +				  MTHP_BITMAP_SIZE);
> +
> +		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> +		threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> +				>> (HPAGE_PMD_ORDER - state.order);
> +
> +		//Check if the region is "almost full" based on the threshold
> +		if (bits_set > threshold_bits || is_pmd_only
> +			|| test_bit(order, &huge_anon_orders_always)) {

When testing this patch, I disabled the PMD-sized THP and enabled 
64K-sized mTHP, but it still attempts to collapse into a PMD-sized THP 
(since bits_set > threshold_bits is ture). This doesn't seem reasonable?

> +			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> +					mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += (1 << order);
> +				continue;
> +			}
> +		}
> +
> +next:
> +		if (state.order > 0) {
> +			next_order = state.order - 1;
> +			mid_offset = offset + (num_chunks / 2);
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, mid_offset };
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, offset };
> +			}
> +	}
> +	return collapsed;
> +}
> +
>   static int khugepaged_scan_pmd(struct mm_struct *mm,
>   				   struct vm_area_struct *vma,
>   				   unsigned long address, bool *mmap_locked,
> @@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
>   		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>   	}
>   out:
>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders
  2025-04-28 18:12 ` [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-04-30 10:09   ` Baolin Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Baolin Wang @ 2025-04-30 10:09 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett



On 2025/4/29 02:12, Nico Pache wrote:
> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> some pages being unmapped. Skip these cases until we have a way to check
> if its ok to collapse to a smaller mTHP size (like in the case of a
> partially mapped folio).
> 
> This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
> 
> [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
> 
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-04-28 18:12 ` [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-04-30 10:12   ` Baolin Wang
  2025-04-30 18:43     ` Nico Pache
  0 siblings, 1 reply; 41+ messages in thread
From: Baolin Wang @ 2025-04-30 10:12 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
  Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett



On 2025/4/29 02:12, Nico Pache wrote:
> There are cases where, if an attempted collapse fails, all subsequent
> orders are guaranteed to also fail. Avoid these collapse attempts by
> bailing out early.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 86d1153ce9e8..5e6732cccb86 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1365,6 +1365,23 @@ static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
>   				collapsed += (1 << order);
>   				continue;
>   			}
> +			/*
> +			 * Some ret values indicate all lower order will also
> +			 * fail, dont trying to collapse smaller orders
> +			 */
> +			if (ret == SCAN_EXCEED_NONE_PTE ||
> +				ret == SCAN_EXCEED_SWAP_PTE ||
> +				ret == SCAN_EXCEED_SHARED_PTE ||
> +				ret == SCAN_PTE_NON_PRESENT ||
> +				ret == SCAN_PTE_UFFD_WP ||
> +				ret == SCAN_ALLOC_HUGE_PAGE_FAIL ||
> +				ret == SCAN_CGROUP_CHARGE_FAIL ||
> +				ret == SCAN_COPY_MC ||
> +				ret == SCAN_PAGE_LOCK ||
> +				ret == SCAN_PAGE_COUNT)
> +				goto next;
> +			else
> +				break;

Better to merge this patch into patch 6, which can be helped to 
understand your logic.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts
  2025-04-30 10:12   ` Baolin Wang
@ 2025-04-30 18:43     ` Nico Pache
  0 siblings, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-30 18:43 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On Wed, Apr 30, 2025 at 4:12 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/4/29 02:12, Nico Pache wrote:
> > There are cases where, if an attempted collapse fails, all subsequent
> > orders are guaranteed to also fail. Avoid these collapse attempts by
> > bailing out early.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 17 +++++++++++++++++
> >   1 file changed, 17 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 86d1153ce9e8..5e6732cccb86 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1365,6 +1365,23 @@ static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> >                               collapsed += (1 << order);
> >                               continue;
> >                       }
> > +                     /*
> > +                      * Some ret values indicate all lower order will also
> > +                      * fail, dont trying to collapse smaller orders
> > +                      */
> > +                     if (ret == SCAN_EXCEED_NONE_PTE ||
> > +                             ret == SCAN_EXCEED_SWAP_PTE ||
> > +                             ret == SCAN_EXCEED_SHARED_PTE ||
> > +                             ret == SCAN_PTE_NON_PRESENT ||
> > +                             ret == SCAN_PTE_UFFD_WP ||
> > +                             ret == SCAN_ALLOC_HUGE_PAGE_FAIL ||
> > +                             ret == SCAN_CGROUP_CHARGE_FAIL ||
> > +                             ret == SCAN_COPY_MC ||
> > +                             ret == SCAN_PAGE_LOCK ||
> > +                             ret == SCAN_PAGE_COUNT)
> > +                             goto next;
> > +                     else
> > +                             break;
>
> Better to merge this patch into patch 6, which can be helped to
> understand your logic.
Sounds good, it wasnt part of the original logic/RFCs so i separated
it out to get some review on it.
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-04-30 10:08   ` Baolin Wang
@ 2025-04-30 18:56     ` Nico Pache
  2025-05-01 23:03       ` Nico Pache
  2025-06-06 16:37       ` Dev Jain
  0 siblings, 2 replies; 41+ messages in thread
From: Nico Pache @ 2025-04-30 18:56 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On Wed, Apr 30, 2025 at 4:08 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/4/29 02:12, Nico Pache wrote:
> > khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> > To add mTHP support we use this scan to instead record chunks of utilized
> > sections of the PMD.
> >
> > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> > that represents chunks of utilized regions. We can then determine what
> > mTHP size fits best and in the following patch, we set this bitmap while
> > scanning the anon PMD.
> >
> > max_ptes_none is used as a scale to determine how "full" an order must
> > be before being considered for collapse.
> >
> > When attempting to collapse an order that has its order set to "always"
> > lets always collapse to that order in a greedy manner without
> > considering the number of bits set.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   include/linux/khugepaged.h |  4 ++
> >   mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
> >   2 files changed, 89 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index 1f46046080f5..18fe6eb5051d 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -1,6 +1,10 @@
> >   /* SPDX-License-Identifier: GPL-2.0 */
> >   #ifndef _LINUX_KHUGEPAGED_H
> >   #define _LINUX_KHUGEPAGED_H
> > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
>
> Still better to add some comments to explain explicitly why choose 2 as
> the MIN_MTHP_ORDER.
Ok i'll add a note that explicitly states that the min order of anon mTHPs is 2
>
> > +#define KHUGEPAGED_MIN_MTHP_NR       (1<<KHUGEPAGED_MIN_MTHP_ORDER)
> > +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> > +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
> >
> >   extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index e21998a06253..6e67db86409a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >   static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +struct scan_bit_state {
> > +     u8 order;
> > +     u16 offset;
> > +};
> > +
> >   struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -102,6 +107,18 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /*
> > +      * bitmap used to collapse mTHP sizes.
> > +      * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> > +      */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> > +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> > +};
> > +
> > +struct collapse_control khugepaged_collapse_control = {
> > +     .is_khugepaged = true,
> >   };
> >
> >   /**
> > @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
> >       remove_wait_queue(&khugepaged_wait, &wait);
> >   }
> >
> > -struct collapse_control khugepaged_collapse_control = {
> > -     .is_khugepaged = true,
> > -};
> > -
> >   static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> >   {
> >       int i;
> > @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >
> >   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                             int referenced, int unmapped,
> > -                           struct collapse_control *cc)
> > +                           struct collapse_control *cc, bool *mmap_locked,
> > +                               u8 order, u16 offset)
> >   {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * The allocation can take potentially a long time if it involves
> >        * sync compaction, and we do not need to hold the mmap_lock during
> >        * that. We will recheck the vma after taking it again in write mode.
> > +      * If collapsing mTHPs we may have already released the read_lock.
> >        */
> > -     mmap_read_unlock(mm);
> > +     if (*mmap_locked) {
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> > +     }
> >
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> > @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >   out_up_write:
> >       mmap_write_unlock(mm);
> >   out_nolock:
> > +     *mmap_locked = false;
> >       if (folio)
> >               folio_put(folio);
> >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> >       return result;
> >   }
> >
> > +// Recursive function to consume the bitmap
>
> Nit: please use '/* Xxxx */' for comments in this patch.
>
> > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > +                     int referenced, int unmapped, struct collapse_control *cc,
> > +                     bool *mmap_locked, unsigned long enabled_orders)
> > +{
> > +     u8 order, next_order;
> > +     u16 offset, mid_offset;
> > +     int num_chunks;
> > +     int bits_set, threshold_bits;
> > +     int top = -1;
> > +     int collapsed = 0;
> > +     int ret;
> > +     struct scan_bit_state state;
> > +     bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> > +
> > +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +             { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> > +
> > +     while (top >= 0) {
> > +             state = cc->mthp_bitmap_stack[top--];
> > +             order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> > +             offset = state.offset;
> > +             num_chunks = 1 << (state.order);
> > +             // Skip mTHP orders that are not enabled
> > +             if (!test_bit(order, &enabled_orders))
> > +                     goto next;
> > +
> > +             // copy the relavant section to a new bitmap
> > +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> > +                               MTHP_BITMAP_SIZE);
> > +
> > +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> > +             threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> > +                             >> (HPAGE_PMD_ORDER - state.order);
> > +
> > +             //Check if the region is "almost full" based on the threshold
> > +             if (bits_set > threshold_bits || is_pmd_only
> > +                     || test_bit(order, &huge_anon_orders_always)) {
>
> When testing this patch, I disabled the PMD-sized THP and enabled
> 64K-sized mTHP, but it still attempts to collapse into a PMD-sized THP
> (since bits_set > threshold_bits is ture). This doesn't seem reasonable?
We are still required to have PMD enabled for mTHP collapse to work.
It's a limitation of the current khugepaged code (it currently only
adds mm_slots when PMD is enabled).
We've discussed this in the past and are looking for a proper way
forward, but the solution becomes tricky.

However I'm surprised that it still collapses due to the code below.
I'll test this out later today.
    +             if (!test_bit(order, &enabled_orders))
    +                     goto next;
>
> > +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> > +                                     mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
> > +                     if (ret == SCAN_SUCCEED) {
> > +                             collapsed += (1 << order);
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +next:
> > +             if (state.order > 0) {
> > +                     next_order = state.order - 1;
> > +                     mid_offset = offset + (num_chunks / 2);
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, mid_offset };
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, offset };
> > +                     }
> > +     }
> > +     return collapsed;
> > +}
> > +
> >   static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                                  struct vm_area_struct *vma,
> >                                  unsigned long address, bool *mmap_locked,
> > @@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> >       }
> >   out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
@ 2025-04-30 20:51   ` Jann Horn
  2025-05-01 22:29     ` Nico Pache
  2025-05-01 12:58   ` Hugh Dickins
  1 sibling, 1 reply; 41+ messages in thread
From: Jann Horn @ 2025-04-30 20:51 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning PMD ranges for potential collapse candidates, keep track
> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> mTHPs are enabled we remove the restriction of max_ptes_none during the
> scan phase so we dont bailout early and miss potential mTHP candidates.
>
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
>
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
[...]
> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         vma_start_write(vma);
>         anon_vma_lock_write(vma->anon_vma);
>
> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -                               address + HPAGE_PMD_SIZE);
> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +                               _address + (PAGE_SIZE << order));
>         mmu_notifier_invalidate_range_start(&range);
>
>         pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>         /*
>          * This removes any huge TLB entry from the CPU so we won't allow
>          * huge and small TLB entries for the same virtual address to

It's not visible in this diff, but we're about to do a
pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
entire page table, meaning it tears down 2MiB of address space; and it
assumes that the entire page table exclusively corresponds to the
current VMA.

I think you'll need to ensure that the pmdp_collapse_flush() only
happens for full-size THP, and that mTHP only tears down individual
PTEs in the relevant range. (That code might get a bit messy, since
the existing THP code tears down PTEs in a detached page table, while
mTHP would have to do it in a still-attached page table.)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
  2025-04-30 20:51   ` Jann Horn
@ 2025-05-01 12:58   ` Hugh Dickins
  2025-05-01 16:15     ` Andrew Morton
  2025-05-01 22:30     ` Nico Pache
  1 sibling, 2 replies; 41+ messages in thread
From: Hugh Dickins @ 2025-05-01 12:58 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett, Jann Horn

On Mon, 28 Apr 2025, Nico Pache wrote:

> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning PMD ranges for potential collapse candidates, keep track
> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> mTHPs are enabled we remove the restriction of max_ptes_none during the
> scan phase so we dont bailout early and miss potential mTHP candidates.
> 
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
> 
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

There are locking errors in this patch.  Let me comment inline below,
then at the end append the fix patch I'm using, to keep mm-new usable
for me. But that's more of an emergency rescue than a recommended fixup:
I don't much like your approach here, and hope it will change in v6.

> ---
>  mm/khugepaged.c | 125 ++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 88 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6e67db86409a..3a846cd70c66 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1136,13 +1136,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;

I didn't wait to see the problem, just noticed that in the v4->v5
update, pte gets used at out_up_write, but there are gotos before
pte has been initialized. Declare pte = NULL here.

>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	int result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -1158,12 +1159,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		*mmap_locked = false;
>  	}
>  
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1181,13 +1183,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
>  
>  	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>  	/*
>  	 * Prevent all access to pagetables with the exception of
>  	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1197,7 +1200,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,

I spent a long time trying to work out why the include/linux/swapops.h:511
BUG is soon hit - the BUG which tells there's a migration entry left behind
after its folio has been unlocked.

In the patch at the end you'll see that I've inserted a check here, to
abort if the VMA following the revalidated "vma" is sharing the page table
(and so also affected by clearing *pmd).  That turned out not to be the
problem (WARN_ONs inserted never fired in my limited testing), but it still
looks to me as if some such check is needed.  Or I may be wrong, and
"revalidate" (a better place for the check) does actually check that, but
it wasn't obvious, and I haven't spent more time looking at it (but it did
appear to rule out the case of a VMA before "vma" sharing the page table).

>  	vma_start_write(vma);
>  	anon_vma_lock_write(vma->anon_vma);
>  
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));

mmu_notifiers tend to be rather a mystery to me, so I've made no change
below, but it's not obvious whether it's correct clear the *pmd but only
notify of clearing a subset of that range: what's outside the range soon
gets replaced as it was, but is that good enough?  I don't know.

>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>  	/*
>  	 * This removes any huge TLB entry from the CPU so we won't allow
>  	 * huge and small TLB entries for the same virtual address to

The line I want to comment on does not appear in this diff context:
	_pmd = pmdp_collapse_flush(vma, address, pmd);

That is appropriate for a THP occupying the whole range of the page table,
but is a surprising way to handle an "mTHP" of just some of its ptes:
I would expect you to be invalidating and replacing just those.

And that is the cause of the swapops:511 BUGs: "uninvolved" ptes are
being temporarily hidden, so not found when remove_migration_ptes()
goes looking for them.

This reliance on pmdp_collapse_flush() can be rescued, with stricter
locking (comment below); but I don't like it, and notice Jann has
picked up on it too.  I hope v6 switches to handling ptes by ptes.

> @@ -1226,18 +1230,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>  
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
>  	}
>  
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);
>  		spin_lock(pmd_ptl);
>  		BUG_ON(!pmd_none(*pmd));
>  		/*
> @@ -1258,9 +1260,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	anon_vma_unlock_write(vma->anon_vma);

Phew, it's just visible there in the context.  The anon_vma lock is what
keeps out racing lookups; so, that anon_vma_unlock_write() (and its
"All pages are isolated and locked" comment) is appropriate in the
HPAGE_PMD_SIZEd THP case, but has to be left until later for mTHP ptes.

But the anon_vma lock may well span a much larger range than the pte
lock, and the pmd lock certainly spans a much larger range than the
pte lock; so we really prefer to release anon_vma lock and pmd lock
as soon as is safe, and use pte lock in preference where possible.

>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> -	pte_unmap(pte);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>  
> @@ -1270,25 +1271,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { /* mTHP collapse */
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);

I haven't changed that, but it is odd: yes, pmd_ptl will be required
when doing the pmd_populate(), but it serves no purpose here when
fiddling around with ptes in a disconnected page table.

> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		spin_unlock(pmd_ptl);
> +	}
>  
>  	folio = NULL;
>  
>  	result = SCAN_SUCCEED;

Somewhere around here it becomes safe for mTHP to anon_vma_unlock_write().

>  out_up_write:
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
>  	*mmap_locked = false;
> @@ -1364,31 +1382,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> +	int i;
>  	int result = SCAN_FAIL, referenced = 0;
>  	int none_or_zero = 0, shared = 0;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long _address;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
> +	bool is_pmd_only;
>  	bool writable = false;
> -
> +	int chunk_none_count = 0;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
> +	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +		tva_flags, THP_ORDERS_ALL_ANON);
> +
> +	is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> +
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (!pte) {
>  		result = SCAN_PMD_NULL;
>  		goto out;
>  	}
>  
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, _address += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		/*
> +		 * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
> +		 * there are pages in this chunk keep track of it in the bitmap
> +		 * for mTHP collapsing.
> +		 */
> +		if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
> +			if (chunk_none_count <= scaled_none)
> +				bitmap_set(cc->mthp_bitmap,
> +					   i / KHUGEPAGED_MIN_MTHP_NR, 1);
> +
> +			chunk_none_count = 0;
> +		}
> +
> +		_pte = pte + i;
> +		_address = address + i * PAGE_SIZE;
>  		pte_t pteval = ptep_get(_pte);
>  		if (is_swap_pte(pteval)) {
>  			++unmapped;
> @@ -1411,10 +1456,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			++chunk_none_count;
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			    (!cc->is_khugepaged || !is_pmd_only ||
> +				none_or_zero <= khugepaged_max_ptes_none)) {
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1510,6 +1556,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  								     address)))
>  			referenced++;
>  	}
> +
>  	if (!writable) {
>  		result = SCAN_PAGE_RO;
>  	} else if (cc->is_khugepaged &&
> @@ -1522,8 +1569,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> +			       mmap_locked, enabled_orders);
> +		if (result > 0)
> +			result = SCAN_SUCCEED;
> +		else
> +			result = SCAN_FAIL;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> -- 
> 2.48.1

Fixes to 07/12 "khugepaged: add mTHP support".
But I see now that the first hunk is actually not to this 07/12, but to
05/12 "khugepaged: generalize __collapse_huge_page_* for mTHP support":
the mTHP check added in __collapse_huge_page_swapin() forgets to unmap
and unlock before returning, causing RCU imbalance warnings and lockups.
I won't separate it out here, let me leave that to you.

And I had other fixes to v4, which you've fixed differently in v5,
I haven't looked up which patch: where khugepaged_collapse_single_pmd()
does mmap_read_(un)lock() around collapse_pte_mapped_thp().  I dislike
your special use of result SCAN_ANY_PROCESS there, because mmap_locked
is precisely the tool for that job, so just lock and unlock without
setting *mmap_locked true (but I'd agree that mmap_locked is confusing,
and offhand wouldn't want to assert exactly what it means - does it
mean that mmap lock was *never* dropped, so "vma" is safe without
revalidation? depends on where it's used perhaps).

Hugh

---
 mm/khugepaged.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c1c637dbcb81..2c814c239d65 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1054,6 +1054,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 		/* Dont swapin for mTHP collapse */
 		if (order != HPAGE_PMD_ORDER) {
+			pte_unmap(pte);
+			mmap_read_unlock(mm);
 			result = SCAN_EXCEED_SWAP_PTE;
 			goto out;
 		}
@@ -1136,7 +1138,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte, mthp_pte;
+	pte_t *pte = NULL, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
@@ -1208,6 +1210,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
+	if (vma->vm_end < address + HPAGE_PMD_SIZE) {
+		struct vm_area_struct *next_vma = find_vma(mm, vma->vm_end);
+		/*
+		 * We must not clear *pmd if it is used by the following VMA.
+		 * Well, perhaps we could if it, and all following VMAs using
+		 * this same page table, share the same anon_vma, and so are
+		 * locked out together: but keep it simple for now (and this
+		 * code might better belong in hugepage_vma_revalidate()).
+		 */
+		if (next_vma && next_vma->vm_start < address + HPAGE_PMD_SIZE) {
+			result = SCAN_ADDRESS_RANGE;
+			goto out_up_write;
+		}
+	}
+
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
@@ -1255,15 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	/*
 	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
-	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	 * can't run anymore - IF the entire extent has been isolated.
+	 * anon_vma lock may cover a large area: better unlock a.s.a.p.
+	 */	
+	if (order == HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, _address, pte_ptl,
 					   &compound_pagelist, order);
 	if (unlikely(result != SCAN_SUCCEED))
-		goto out_up_write;
+		goto out_unlock_anon_vma;
 
 	/*
 	 * The smp_wmb() inside __folio_mark_uptodate() ensures the
@@ -1304,6 +1323,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
+out_unlock_anon_vma:
+	if (order != HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 out_up_write:
 	if (pte)
 		pte_unmap(pte);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-01 12:58   ` Hugh Dickins
@ 2025-05-01 16:15     ` Andrew Morton
  2025-05-01 22:30     ` Nico Pache
  1 sibling, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2025-05-01 16:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett, Jann Horn

On Thu, 1 May 2025 05:58:40 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> There are locking errors in this patch. 

Thanks, Hugh.  I'll remove the v5 series from mm-new.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-04-30 20:51   ` Jann Horn
@ 2025-05-01 22:29     ` Nico Pache
  2025-05-02  6:29       ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Nico Pache @ 2025-05-01 22:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we dont bailout early and miss potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> [...]
> > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >         vma_start_write(vma);
> >         anon_vma_lock_write(vma->anon_vma);
> >
> > -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                               address + HPAGE_PMD_SIZE);
> > +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                               _address + (PAGE_SIZE << order));
> >         mmu_notifier_invalidate_range_start(&range);
> >
> >         pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >         /*
> >          * This removes any huge TLB entry from the CPU so we won't allow
> >          * huge and small TLB entries for the same virtual address to
>
> It's not visible in this diff, but we're about to do a
> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> entire page table, meaning it tears down 2MiB of address space; and it
> assumes that the entire page table exclusively corresponds to the
> current VMA.
>
> I think you'll need to ensure that the pmdp_collapse_flush() only
> happens for full-size THP, and that mTHP only tears down individual
> PTEs in the relevant range. (That code might get a bit messy, since
> the existing THP code tears down PTEs in a detached page table, while
> mTHP would have to do it in a still-attached page table.)
Hi Jann!

I was under the impression that this is needed to prevent GUP-fast
races (and potentially others).
As you state here, conceptually the PMD case is, detach the PMD, do
the collapse, then reinstall the PMD (similarly to how the system
recovers from a failed PMD collapse). I tried to keep the current
locking behavior as it seemed the easiest way to get it right (and not
break anything). So I keep the PMD detaching and reinstalling for the
mTHP case too. As Hugh points out I am releasing the anon lock too
early. I will comment further on his response.

As I familiarize myself with the code more, I do see potential code
improvements/cleanups and locking improvements, but I was going to
leave those to a later series.

Thanks
-- Nico
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-01 12:58   ` Hugh Dickins
  2025-05-01 16:15     ` Andrew Morton
@ 2025-05-01 22:30     ` Nico Pache
  1 sibling, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-05-01 22:30 UTC (permalink / raw)
  To: Hugh Dickins, Jann Horn
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

On Thu, May 1, 2025 at 6:59 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Mon, 28 Apr 2025, Nico Pache wrote:
>
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we dont bailout early and miss potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> There are locking errors in this patch.  Let me comment inline below,
> then at the end append the fix patch I'm using, to keep mm-new usable
> for me. But that's more of an emergency rescue than a recommended fixup:
> I don't much like your approach here, and hope it will change in v6.
Hi Hugh,

Thank you for testing, and providing such a detailed report + fix!

>
> > ---
> >  mm/khugepaged.c | 125 ++++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 88 insertions(+), 37 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 6e67db86409a..3a846cd70c66 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1136,13 +1136,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
>
> I didn't wait to see the problem, just noticed that in the v4->v5
> update, pte gets used at out_up_write, but there are gotos before
> pte has been initialized. Declare pte = NULL here.

Ah thanks I missed that when I moved the PTE unmapping to out_up_write.

>
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > @@ -1158,12 +1159,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1181,13 +1183,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1197,7 +1200,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
> I spent a long time trying to work out why the include/linux/swapops.h:511
> BUG is soon hit - the BUG which tells there's a migration entry left behind
> after its folio has been unlocked.

Can you please share how you reproduce the bug? I haven't been able to
reproduce any BUGs with most of the memory debuggers enabled. This has
also been through a number of mm test suites and some internal
testing. So i'm curious how to hit this issue!
>
> In the patch at the end you'll see that I've inserted a check here, to
> abort if the VMA following the revalidated "vma" is sharing the page table
> (and so also affected by clearing *pmd).  That turned out not to be the
> problem (WARN_ONs inserted never fired in my limited testing), but it still
> looks to me as if some such check is needed.  Or I may be wrong, and
> "revalidate" (a better place for the check) does actually check that, but
> it wasn't obvious, and I haven't spent more time looking at it (but it did
> appear to rule out the case of a VMA before "vma" sharing the page table

Hmm if I understand this correctly, the check you are adding is
already being handled in the PMD case through
thp_vma_suitable_order(s) (by checking that a PMD can fit in the VMA
@address); however, I think you are correct, we much check that the
VMA still spans the whole PMD region we are trying to a mTHP collapse
within it... if not its possible that a mremap+mmap might have occured
allowing a different mapping to belong to the same PMD (this also
makes support mTHP collapse in mappings <PMD sized difficult and is
the reason we haven't pursued it yet, the locking requires explode,
the simple solutions is as you noted, only support <PMD sized
collapses with a single VMA in the PMD range, this is a future
change).

I actually realized my revalidation is rather weak... mostly because
I'm passing the PMD address, not the mTHP starting address (see
address vs _address in collapse_huge_page). I think your check would
be handled if I pass it _address. If that was the case the following
check would be performed in thp_vma_suitable_order:

if (haddr < vma->vm_start || haddr + hpage_size > vma->vm_end)
return false;


>
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
>
> mmu_notifiers tend to be rather a mystery to me, so I've made no change
> below, but it's not obvious whether it's correct clear the *pmd but only
> notify of clearing a subset of that range: what's outside the range soon
> gets replaced as it was, but is that good enough?  I don't know.
Sadly I dont have a good answer for you as mmu_notifier is mostly a
mystery to me too. I assume the anon_vma_lock_write would prevent any
others from touching the pages we ARE NOT invalidating here, and that
may be enough. Honestly I think the anon_vma_lock_write allows up to
remove a lot of the locking (but as i stated im trying to keep the
locking the  *same* in this series).
>
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
>
> The line I want to comment on does not appear in this diff context:
>         _pmd = pmdp_collapse_flush(vma, address, pmd);
>
> That is appropriate for a THP occupying the whole range of the page table,
> but is a surprising way to handle an "mTHP" of just some of its ptes:
> I would expect you to be invalidating and replacing just those.

The reason for leaving it in the mTHP case was to prevent the
aforementioned GUP-fast race.
* Parallel GUP-fast is fine since GUP-fast will back off when
* it detects PMD is changed.

I tried to keep the same locking principles as the PMD case: remove
the PMD, hold all the same locks, collapse to (m)THP, reinstall the
PMD. This seems the easiest way to get it right and allowed me to
focus on adding the feature. The codebase being scattered with
mentions of potential races, and taking locks that might not be
needed, also made me weary.

I know Dev was looking into optimizations for the locking, and I have
some ideas of ways to improve it, or at least clean up some of the
locking.
>
> And that is the cause of the swapops:511 BUGs: "uninvolved" ptes are
> being temporarily hidden, so not found when remove_migration_ptes()
> goes looking for them.
>
> This reliance on pmdp_collapse_flush() can be rescued, with stricter
> locking (comment below); but I don't like it, and notice Jann has
> picked up on it too.  I hope v6 switches to handling ptes by ptes.
I believe the current approach with your anon_write_unlock fix would
suffice (if the gup race is a valid reason to keep the pmd flush), and
then we can follow up with locking improvements.
>
> > @@ -1226,18 +1230,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> >       }
> >
> >       if (unlikely(result != SCAN_SUCCEED)) {
> > -             if (pte)
> > -                     pte_unmap(pte);
> >               spin_lock(pmd_ptl);
> >               BUG_ON(!pmd_none(*pmd));
> >               /*
> > @@ -1258,9 +1260,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
>
> Phew, it's just visible there in the context.  The anon_vma lock is what
> keeps out racing lookups; so, that anon_vma_unlock_write() (and its
> "All pages are isolated and locked" comment) is appropriate in the
> HPAGE_PMD_SIZEd THP case, but has to be left until later for mTHP ptes.
Makes sense!
>
> But the anon_vma lock may well span a much larger range than the pte
> lock, and the pmd lock certainly spans a much larger range than the
> pte lock; so we really prefer to release anon_vma lock and pmd lock
> as soon as is safe, and use pte lock in preference where possible.
>
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > -     pte_unmap(pte);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> >
> > @@ -1270,25 +1271,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { /* mTHP collapse */
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
>
> I haven't changed that, but it is odd: yes, pmd_ptl will be required
> when doing the pmd_populate(), but it serves no purpose here when
> fiddling around with ptes in a disconnected page table.
Yes I believe a lot of locking can be improved and some is unnecessary.
>
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             spin_unlock(pmd_ptl);
> > +     }
> >
> >       folio = NULL;
> >
> >       result = SCAN_SUCCEED;
>
> Somewhere around here it becomes safe for mTHP to anon_vma_unlock_write().
>
> >  out_up_write:
> > +     if (pte)
> > +             pte_unmap(pte);
> >       mmap_write_unlock(mm);
> >  out_nolock:
> >       *mmap_locked = false;
> > @@ -1364,31 +1382,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > +     int i;
> >       int result = SCAN_FAIL, referenced = 0;
> >       int none_or_zero = 0, shared = 0;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long _address;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> > +     bool is_pmd_only;
> >       bool writable = false;
> > -
> > +     int chunk_none_count = 0;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> > +
> > +     enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +             tva_flags, THP_ORDERS_ALL_ANON);
> > +
> > +     is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> > +
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> >       if (!pte) {
> >               result = SCAN_PMD_NULL;
> >               goto out;
> >       }
> >
> > -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, _address += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             /*
> > +              * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
> > +              * there are pages in this chunk keep track of it in the bitmap
> > +              * for mTHP collapsing.
> > +              */
> > +             if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
> > +                     if (chunk_none_count <= scaled_none)
> > +                             bitmap_set(cc->mthp_bitmap,
> > +                                        i / KHUGEPAGED_MIN_MTHP_NR, 1);
> > +
> > +                     chunk_none_count = 0;
> > +             }
> > +
> > +             _pte = pte + i;
> > +             _address = address + i * PAGE_SIZE;
> >               pte_t pteval = ptep_get(_pte);
> >               if (is_swap_pte(pteval)) {
> >                       ++unmapped;
> > @@ -1411,10 +1456,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +                     ++chunk_none_count;
> >                       ++none_or_zero;
> >                       if (!userfaultfd_armed(vma) &&
> > -                         (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > +                         (!cc->is_khugepaged || !is_pmd_only ||
> > +                             none_or_zero <= khugepaged_max_ptes_none)) {
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -1510,6 +1556,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                                                                    address)))
> >                       referenced++;
> >       }
> > +
> >       if (!writable) {
> >               result = SCAN_PAGE_RO;
> >       } else if (cc->is_khugepaged &&
> > @@ -1522,8 +1569,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > -             result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> > +                            mmap_locked, enabled_orders);
> > +             if (result > 0)
> > +                     result = SCAN_SUCCEED;
> > +             else
> > +                     result = SCAN_FAIL;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> > --
> > 2.48.1
>
> Fixes to 07/12 "khugepaged: add mTHP support".
> But I see now that the first hunk is actually not to this 07/12, but to
> 05/12 "khugepaged: generalize __collapse_huge_page_* for mTHP support":
> the mTHP check added in __collapse_huge_page_swapin() forgets to unmap
> and unlock before returning, causing RCU imbalance warnings and lockups.
> I won't separate it out here, let me leave that to you.
Thanks I'll get both of these issues fixed!
>
> And I had other fixes to v4, which you've fixed differently in v5,
> I haven't looked up which patch: where khugepaged_collapse_single_pmd()
> does mmap_read_(un)lock() around collapse_pte_mapped_thp().  I dislike
> your special use of result SCAN_ANY_PROCESS there, because mmap_locked

This is what the revalidate func is doing with
khugepaged_test_exit_or_disable so I tried to keep it consistent.
Sadly this was the cleanest solution I could come up with. If you dont
mind sharing what your solution to this was, I'd be happy to take a
look!

> is precisely the tool for that job, so just lock and unlock without
> setting *mmap_locked true (but I'd agree that mmap_locked is confusing,
> and offhand wouldn't want to assert exactly what it means - does it
> mean that mmap lock was *never* dropped, so "vma" is safe without
> revalidation? depends on where it's used perhaps).
I agree its use is confusing and I like to think of it as ONLY the
indicator of it being locked/unlocked. Im not sure if there were other
intended uses.
>
> Hugh
>
> ---
>  mm/khugepaged.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c1c637dbcb81..2c814c239d65 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1054,6 +1054,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>
>                 /* Dont swapin for mTHP collapse */
>                 if (order != HPAGE_PMD_ORDER) {
> +                       pte_unmap(pte);
> +                       mmap_read_unlock(mm);
>                         result = SCAN_EXCEED_SWAP_PTE;
>                         goto out;
>                 }
> @@ -1136,7 +1138,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>         LIST_HEAD(compound_pagelist);
>         pmd_t *pmd, _pmd;
> -       pte_t *pte, mthp_pte;
> +       pte_t *pte = NULL, mthp_pte;
>         pgtable_t pgtable;
>         struct folio *folio;
>         spinlock_t *pmd_ptl, *pte_ptl;
> @@ -1208,6 +1210,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>
> +       if (vma->vm_end < address + HPAGE_PMD_SIZE) {
> +               struct vm_area_struct *next_vma = find_vma(mm, vma->vm_end);
> +               /*
> +                * We must not clear *pmd if it is used by the following VMA.
> +                * Well, perhaps we could if it, and all following VMAs using
> +                * this same page table, share the same anon_vma, and so are
> +                * locked out together: but keep it simple for now (and this
> +                * code might better belong in hugepage_vma_revalidate()).
> +                */
> +               if (next_vma && next_vma->vm_start < address + HPAGE_PMD_SIZE) {
> +                       result = SCAN_ADDRESS_RANGE;
> +                       goto out_up_write;
> +               }
> +       }
> +
>         vma_start_write(vma);
>         anon_vma_lock_write(vma->anon_vma);
>
> @@ -1255,15 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
>         /*
>          * All pages are isolated and locked so anon_vma rmap
> -        * can't run anymore.
> -        */
> -       anon_vma_unlock_write(vma->anon_vma);
> +        * can't run anymore - IF the entire extent has been isolated.
> +        * anon_vma lock may cover a large area: better unlock a.s.a.p.
> +        */
> +       if (order == HPAGE_PMD_ORDER)
> +               anon_vma_unlock_write(vma->anon_vma);
>
>         result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>                                            vma, _address, pte_ptl,
>                                            &compound_pagelist, order);
>         if (unlikely(result != SCAN_SUCCEED))
> -               goto out_up_write;
> +               goto out_unlock_anon_vma;
>
>         /*
>          * The smp_wmb() inside __folio_mark_uptodate() ensures the
> @@ -1304,6 +1323,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         folio = NULL;
>
>         result = SCAN_SUCCEED;
> +out_unlock_anon_vma:
> +       if (order != HPAGE_PMD_ORDER)
> +               anon_vma_unlock_write(vma->anon_vma);
>  out_up_write:
>         if (pte)
>                 pte_unmap(pte);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-04-30 18:56     ` Nico Pache
@ 2025-05-01 23:03       ` Nico Pache
  2025-05-02  1:23         ` Baolin Wang
  2025-06-06 16:37       ` Dev Jain
  1 sibling, 1 reply; 41+ messages in thread
From: Nico Pache @ 2025-05-01 23:03 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On Wed, Apr 30, 2025 at 12:56 PM Nico Pache <npache@redhat.com> wrote:
>
> On Wed, Apr 30, 2025 at 4:08 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2025/4/29 02:12, Nico Pache wrote:
> > > khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> > > To add mTHP support we use this scan to instead record chunks of utilized
> > > sections of the PMD.
> > >
> > > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> > > that represents chunks of utilized regions. We can then determine what
> > > mTHP size fits best and in the following patch, we set this bitmap while
> > > scanning the anon PMD.
> > >
> > > max_ptes_none is used as a scale to determine how "full" an order must
> > > be before being considered for collapse.
> > >
> > > When attempting to collapse an order that has its order set to "always"
> > > lets always collapse to that order in a greedy manner without
> > > considering the number of bits set.
> > >
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >   include/linux/khugepaged.h |  4 ++
> > >   mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
> > >   2 files changed, 89 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > index 1f46046080f5..18fe6eb5051d 100644
> > > --- a/include/linux/khugepaged.h
> > > +++ b/include/linux/khugepaged.h
> > > @@ -1,6 +1,10 @@
> > >   /* SPDX-License-Identifier: GPL-2.0 */
> > >   #ifndef _LINUX_KHUGEPAGED_H
> > >   #define _LINUX_KHUGEPAGED_H
> > > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> >
> > Still better to add some comments to explain explicitly why choose 2 as
> > the MIN_MTHP_ORDER.
> Ok i'll add a note that explicitly states that the min order of anon mTHPs is 2
> >
> > > +#define KHUGEPAGED_MIN_MTHP_NR       (1<<KHUGEPAGED_MIN_MTHP_ORDER)
> > > +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> > > +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
> > >
> > >   extern unsigned int khugepaged_max_ptes_none __read_mostly;
> > >   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index e21998a06253..6e67db86409a 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> > >
> > >   static struct kmem_cache *mm_slot_cache __ro_after_init;
> > >
> > > +struct scan_bit_state {
> > > +     u8 order;
> > > +     u16 offset;
> > > +};
> > > +
> > >   struct collapse_control {
> > >       bool is_khugepaged;
> > >
> > > @@ -102,6 +107,18 @@ struct collapse_control {
> > >
> > >       /* nodemask for allocation fallback */
> > >       nodemask_t alloc_nmask;
> > > +
> > > +     /*
> > > +      * bitmap used to collapse mTHP sizes.
> > > +      * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> > > +      */
> > > +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > > +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> > > +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> > > +};
> > > +
> > > +struct collapse_control khugepaged_collapse_control = {
> > > +     .is_khugepaged = true,
> > >   };
> > >
> > >   /**
> > > @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
> > >       remove_wait_queue(&khugepaged_wait, &wait);
> > >   }
> > >
> > > -struct collapse_control khugepaged_collapse_control = {
> > > -     .is_khugepaged = true,
> > > -};
> > > -
> > >   static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> > >   {
> > >       int i;
> > > @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> > >
> > >   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >                             int referenced, int unmapped,
> > > -                           struct collapse_control *cc)
> > > +                           struct collapse_control *cc, bool *mmap_locked,
> > > +                               u8 order, u16 offset)
> > >   {
> > >       LIST_HEAD(compound_pagelist);
> > >       pmd_t *pmd, _pmd;
> > > @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >        * The allocation can take potentially a long time if it involves
> > >        * sync compaction, and we do not need to hold the mmap_lock during
> > >        * that. We will recheck the vma after taking it again in write mode.
> > > +      * If collapsing mTHPs we may have already released the read_lock.
> > >        */
> > > -     mmap_read_unlock(mm);
> > > +     if (*mmap_locked) {
> > > +             mmap_read_unlock(mm);
> > > +             *mmap_locked = false;
> > > +     }
> > >
> > >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > >       if (result != SCAN_SUCCEED)
> > > @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >   out_up_write:
> > >       mmap_write_unlock(mm);
> > >   out_nolock:
> > > +     *mmap_locked = false;
> > >       if (folio)
> > >               folio_put(folio);
> > >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > >       return result;
> > >   }
> > >
> > > +// Recursive function to consume the bitmap
> >
> > Nit: please use '/* Xxxx */' for comments in this patch.
> >
> > > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > > +                     int referenced, int unmapped, struct collapse_control *cc,
> > > +                     bool *mmap_locked, unsigned long enabled_orders)
> > > +{
> > > +     u8 order, next_order;
> > > +     u16 offset, mid_offset;
> > > +     int num_chunks;
> > > +     int bits_set, threshold_bits;
> > > +     int top = -1;
> > > +     int collapsed = 0;
> > > +     int ret;
> > > +     struct scan_bit_state state;
> > > +     bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> > > +
> > > +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > > +             { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> > > +
> > > +     while (top >= 0) {
> > > +             state = cc->mthp_bitmap_stack[top--];
> > > +             order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> > > +             offset = state.offset;
> > > +             num_chunks = 1 << (state.order);
> > > +             // Skip mTHP orders that are not enabled
> > > +             if (!test_bit(order, &enabled_orders))
> > > +                     goto next;
> > > +
> > > +             // copy the relavant section to a new bitmap
> > > +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> > > +                               MTHP_BITMAP_SIZE);
> > > +
> > > +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> > > +             threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> > > +                             >> (HPAGE_PMD_ORDER - state.order);
> > > +
> > > +             //Check if the region is "almost full" based on the threshold
> > > +             if (bits_set > threshold_bits || is_pmd_only
> > > +                     || test_bit(order, &huge_anon_orders_always)) {
> >
> > When testing this patch, I disabled the PMD-sized THP and enabled
> > 64K-sized mTHP, but it still attempts to collapse into a PMD-sized THP
> > (since bits_set > threshold_bits is ture). This doesn't seem reasonable?
> We are still required to have PMD enabled for mTHP collapse to work.
> It's a limitation of the current khugepaged code (it currently only
> adds mm_slots when PMD is enabled).
> We've discussed this in the past and are looking for a proper way
> forward, but the solution becomes tricky.
>
> However I'm surprised that it still collapses due to the code below.
> I'll test this out later today.

Following up, you are correct, if I disable the PMD size within the
mTHP enabled settings (echo never >
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled) it still
collapses to PMDs. I believe the global variable takes precedent. I'm
not sure what the correct behavior is... I will look into it further

>     +             if (!test_bit(order, &enabled_orders))
>     +                     goto next;
> >
> > > +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> > > +                                     mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
> > > +                     if (ret == SCAN_SUCCEED) {
> > > +                             collapsed += (1 << order);
> > > +                             continue;
> > > +                     }
> > > +             }
> > > +
> > > +next:
> > > +             if (state.order > 0) {
> > > +                     next_order = state.order - 1;
> > > +                     mid_offset = offset + (num_chunks / 2);
> > > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > > +                             { next_order, mid_offset };
> > > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > > +                             { next_order, offset };
> > > +                     }
> > > +     }
> > > +     return collapsed;
> > > +}
> > > +
> > >   static int khugepaged_scan_pmd(struct mm_struct *mm,
> > >                                  struct vm_area_struct *vma,
> > >                                  unsigned long address, bool *mmap_locked,
> > > @@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > >       pte_unmap_unlock(pte, ptl);
> > >       if (result == SCAN_SUCCEED) {
> > >               result = collapse_huge_page(mm, address, referenced,
> > > -                                         unmapped, cc);
> > > -             /* collapse_huge_page will return with the mmap_lock released */
> > > -             *mmap_locked = false;
> > > +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > >       }
> > >   out:
> > >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> >


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-05-01 23:03       ` Nico Pache
@ 2025-05-02  1:23         ` Baolin Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Baolin Wang @ 2025-05-02  1:23 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett



On 2025/5/2 07:03, Nico Pache wrote:
> On Wed, Apr 30, 2025 at 12:56 PM Nico Pache <npache@redhat.com> wrote:
>>
>> On Wed, Apr 30, 2025 at 4:08 AM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> On 2025/4/29 02:12, Nico Pache wrote:
>>>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
>>>> To add mTHP support we use this scan to instead record chunks of utilized
>>>> sections of the PMD.
>>>>
>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>>>> that represents chunks of utilized regions. We can then determine what
>>>> mTHP size fits best and in the following patch, we set this bitmap while
>>>> scanning the anon PMD.
>>>>
>>>> max_ptes_none is used as a scale to determine how "full" an order must
>>>> be before being considered for collapse.
>>>>
>>>> When attempting to collapse an order that has its order set to "always"
>>>> lets always collapse to that order in a greedy manner without
>>>> considering the number of bits set.
>>>>
>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>> ---
>>>>    include/linux/khugepaged.h |  4 ++
>>>>    mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
>>>>    2 files changed, 89 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>>>> index 1f46046080f5..18fe6eb5051d 100644
>>>> --- a/include/linux/khugepaged.h
>>>> +++ b/include/linux/khugepaged.h
>>>> @@ -1,6 +1,10 @@
>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>    #ifndef _LINUX_KHUGEPAGED_H
>>>>    #define _LINUX_KHUGEPAGED_H
>>>> +#define KHUGEPAGED_MIN_MTHP_ORDER    2
>>>
>>> Still better to add some comments to explain explicitly why choose 2 as
>>> the MIN_MTHP_ORDER.
>> Ok i'll add a note that explicitly states that the min order of anon mTHPs is 2
>>>
>>>> +#define KHUGEPAGED_MIN_MTHP_NR       (1<<KHUGEPAGED_MIN_MTHP_ORDER)
>>>> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
>>>> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
>>>>
>>>>    extern unsigned int khugepaged_max_ptes_none __read_mostly;
>>>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index e21998a06253..6e67db86409a 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>>>
>>>>    static struct kmem_cache *mm_slot_cache __ro_after_init;
>>>>
>>>> +struct scan_bit_state {
>>>> +     u8 order;
>>>> +     u16 offset;
>>>> +};
>>>> +
>>>>    struct collapse_control {
>>>>        bool is_khugepaged;
>>>>
>>>> @@ -102,6 +107,18 @@ struct collapse_control {
>>>>
>>>>        /* nodemask for allocation fallback */
>>>>        nodemask_t alloc_nmask;
>>>> +
>>>> +     /*
>>>> +      * bitmap used to collapse mTHP sizes.
>>>> +      * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
>>>> +      */
>>>> +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
>>>> +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>>>> +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
>>>> +};
>>>> +
>>>> +struct collapse_control khugepaged_collapse_control = {
>>>> +     .is_khugepaged = true,
>>>>    };
>>>>
>>>>    /**
>>>> @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
>>>>        remove_wait_queue(&khugepaged_wait, &wait);
>>>>    }
>>>>
>>>> -struct collapse_control khugepaged_collapse_control = {
>>>> -     .is_khugepaged = true,
>>>> -};
>>>> -
>>>>    static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>>>>    {
>>>>        int i;
>>>> @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>>>
>>>>    static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>                              int referenced, int unmapped,
>>>> -                           struct collapse_control *cc)
>>>> +                           struct collapse_control *cc, bool *mmap_locked,
>>>> +                               u8 order, u16 offset)
>>>>    {
>>>>        LIST_HEAD(compound_pagelist);
>>>>        pmd_t *pmd, _pmd;
>>>> @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>         * The allocation can take potentially a long time if it involves
>>>>         * sync compaction, and we do not need to hold the mmap_lock during
>>>>         * that. We will recheck the vma after taking it again in write mode.
>>>> +      * If collapsing mTHPs we may have already released the read_lock.
>>>>         */
>>>> -     mmap_read_unlock(mm);
>>>> +     if (*mmap_locked) {
>>>> +             mmap_read_unlock(mm);
>>>> +             *mmap_locked = false;
>>>> +     }
>>>>
>>>>        result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>>>        if (result != SCAN_SUCCEED)
>>>> @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>    out_up_write:
>>>>        mmap_write_unlock(mm);
>>>>    out_nolock:
>>>> +     *mmap_locked = false;
>>>>        if (folio)
>>>>                folio_put(folio);
>>>>        trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>>>>        return result;
>>>>    }
>>>>
>>>> +// Recursive function to consume the bitmap
>>>
>>> Nit: please use '/* Xxxx */' for comments in this patch.
>>>
>>>> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
>>>> +                     int referenced, int unmapped, struct collapse_control *cc,
>>>> +                     bool *mmap_locked, unsigned long enabled_orders)
>>>> +{
>>>> +     u8 order, next_order;
>>>> +     u16 offset, mid_offset;
>>>> +     int num_chunks;
>>>> +     int bits_set, threshold_bits;
>>>> +     int top = -1;
>>>> +     int collapsed = 0;
>>>> +     int ret;
>>>> +     struct scan_bit_state state;
>>>> +     bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
>>>> +
>>>> +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
>>>> +             { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
>>>> +
>>>> +     while (top >= 0) {
>>>> +             state = cc->mthp_bitmap_stack[top--];
>>>> +             order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
>>>> +             offset = state.offset;
>>>> +             num_chunks = 1 << (state.order);
>>>> +             // Skip mTHP orders that are not enabled
>>>> +             if (!test_bit(order, &enabled_orders))
>>>> +                     goto next;
>>>> +
>>>> +             // copy the relavant section to a new bitmap
>>>> +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
>>>> +                               MTHP_BITMAP_SIZE);
>>>> +
>>>> +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
>>>> +             threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
>>>> +                             >> (HPAGE_PMD_ORDER - state.order);
>>>> +
>>>> +             //Check if the region is "almost full" based on the threshold
>>>> +             if (bits_set > threshold_bits || is_pmd_only
>>>> +                     || test_bit(order, &huge_anon_orders_always)) {
>>>
>>> When testing this patch, I disabled the PMD-sized THP and enabled
>>> 64K-sized mTHP, but it still attempts to collapse into a PMD-sized THP
>>> (since bits_set > threshold_bits is ture). This doesn't seem reasonable?
>> We are still required to have PMD enabled for mTHP collapse to work.
>> It's a limitation of the current khugepaged code (it currently only
>> adds mm_slots when PMD is enabled).
>> We've discussed this in the past and are looking for a proper way
>> forward, but the solution becomes tricky.
>>
>> However I'm surprised that it still collapses due to the code below.
>> I'll test this out later today.
> 
> Following up, you are correct, if I disable the PMD size within the
> mTHP enabled settings (echo never >
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled) it still
> collapses to PMDs. I believe the global variable takes precedent. I'm
> not sure what the correct behavior is... I will look into it further
> 
>>      +             if (!test_bit(order, &enabled_orders))
>>      +                     goto next;

IMO, we should respect the mTHP sysfs control interfaces and use the 
'TVA_ENFORCE_SYSFS' flag when determining allowable orders through 
thp_vma_allowable_orders().

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-01 22:29     ` Nico Pache
@ 2025-05-02  6:29       ` David Hildenbrand
  2025-05-02 12:50         ` Jann Horn
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-05-02  6:29 UTC (permalink / raw)
  To: Nico Pache, Jann Horn
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, baohua, baolin.wang,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On 02.05.25 00:29, Nico Pache wrote:
> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>
>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>> While scanning PMD ranges for potential collapse candidates, keep track
>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>
>>> After the scan is complete we will perform binary recursion on the
>>> bitmap to determine which mTHP size would be most efficient to collapse
>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>> determine how full a THP must be to be eligible.
>>>
>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>> pages, we dont perform the collapse.
>> [...]
>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>          vma_start_write(vma);
>>>          anon_vma_lock_write(vma->anon_vma);
>>>
>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -                               address + HPAGE_PMD_SIZE);
>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>> +                               _address + (PAGE_SIZE << order));
>>>          mmu_notifier_invalidate_range_start(&range);
>>>
>>>          pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> +
>>>          /*
>>>           * This removes any huge TLB entry from the CPU so we won't allow
>>>           * huge and small TLB entries for the same virtual address to
>>
>> It's not visible in this diff, but we're about to do a
>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>> entire page table, meaning it tears down 2MiB of address space; and it
>> assumes that the entire page table exclusively corresponds to the
>> current VMA.
>>
>> I think you'll need to ensure that the pmdp_collapse_flush() only
>> happens for full-size THP, and that mTHP only tears down individual
>> PTEs in the relevant range. (That code might get a bit messy, since
>> the existing THP code tears down PTEs in a detached page table, while
>> mTHP would have to do it in a still-attached page table.)
> Hi Jann!
> 
> I was under the impression that this is needed to prevent GUP-fast
> races (and potentially others).
> As you state here, conceptually the PMD case is, detach the PMD, do
> the collapse, then reinstall the PMD (similarly to how the system
> recovers from a failed PMD collapse). I tried to keep the current
> locking behavior as it seemed the easiest way to get it right (and not
> break anything). So I keep the PMD detaching and reinstalling for the
> mTHP case too. As Hugh points out I am releasing the anon lock too
> early. I will comment further on his response.
> 
> As I familiarize myself with the code more, I do see potential code
> improvements/cleanups and locking improvements, but I was going to
> leave those to a later series.

Right, the simplest approach on top of the current PMD collapse is to do 
exactly what we do in the PMD case, including the locking: which 
apparently is no completely the same yet :).

Instead of installing a PMD THP, we modify the page table and remap that.

Moving from the PMD lock to the PTE lock will not make a big change in 
practice for most cases: we already must disable essentially all page 
table walkers (vma lock, mmap lock in write, rmap lock in write).

The PMDP clear+flush is primarily to disable the last possible set of 
page table walkers: (1) HW modifications and (2) GUP-fast.

So after the PMDP clear+flush we know that (A) HW can not modify the 
pages concurrently and (B) GUP-fast cannot succeed anymore.

The issue with PTEP clear+flush is that we will have to remember all PTE 
values, to reset them if anything goes wrong. Using a single PMD value 
is arguably simpler. And then, the benefit vs. complexity is unclear.

Certainly something to look into later, but not a requirement for the 
first support,

The real challenge/benefit will be looking into avoiding taking all the 
heavy weight locks. Dev has already been thinking about that. For mTHP 
it might be easier than for THPs. Probably it will involve setting PTE 
migration entries whenever we drop the PTL, and dealing with the 
possibility of concurrent zapping of these migration entries.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-02  6:29       ` David Hildenbrand
@ 2025-05-02 12:50         ` Jann Horn
  2025-05-02 15:18           ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Jann Horn @ 2025-05-02 12:50 UTC (permalink / raw)
  To: David Hildenbrand, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, baohua, baolin.wang,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> On 02.05.25 00:29, Nico Pache wrote:
> > On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> >>
> >> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> >>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> >>> While scanning PMD ranges for potential collapse candidates, keep track
> >>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> >>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> >>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> >>> scan phase so we dont bailout early and miss potential mTHP candidates.
> >>>
> >>> After the scan is complete we will perform binary recursion on the
> >>> bitmap to determine which mTHP size would be most efficient to collapse
> >>> to. max_ptes_none will be scaled by the attempted collapse order to
> >>> determine how full a THP must be to be eligible.
> >>>
> >>> If a mTHP collapse is attempted, but contains swapped out, or shared
> >>> pages, we dont perform the collapse.
> >> [...]
> >>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>          vma_start_write(vma);
> >>>          anon_vma_lock_write(vma->anon_vma);
> >>>
> >>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> >>> -                               address + HPAGE_PMD_SIZE);
> >>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> >>> +                               _address + (PAGE_SIZE << order));
> >>>          mmu_notifier_invalidate_range_start(&range);
> >>>
> >>>          pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >>> +
> >>>          /*
> >>>           * This removes any huge TLB entry from the CPU so we won't allow
> >>>           * huge and small TLB entries for the same virtual address to
> >>
> >> It's not visible in this diff, but we're about to do a
> >> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> >> entire page table, meaning it tears down 2MiB of address space; and it
> >> assumes that the entire page table exclusively corresponds to the
> >> current VMA.
> >>
> >> I think you'll need to ensure that the pmdp_collapse_flush() only
> >> happens for full-size THP, and that mTHP only tears down individual
> >> PTEs in the relevant range. (That code might get a bit messy, since
> >> the existing THP code tears down PTEs in a detached page table, while
> >> mTHP would have to do it in a still-attached page table.)
> > Hi Jann!
> >
> > I was under the impression that this is needed to prevent GUP-fast
> > races (and potentially others).

Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?

> > As you state here, conceptually the PMD case is, detach the PMD, do
> > the collapse, then reinstall the PMD (similarly to how the system
> > recovers from a failed PMD collapse). I tried to keep the current
> > locking behavior as it seemed the easiest way to get it right (and not
> > break anything). So I keep the PMD detaching and reinstalling for the
> > mTHP case too. As Hugh points out I am releasing the anon lock too
> > early. I will comment further on his response.

As I see it, you're not "keeping" the current locking behavior; you're
making a big implicit locking change by reusing a codepath designed
for PMD THP for mTHP, where the page table may not be exclusively
owned by one VMA.

> > As I familiarize myself with the code more, I do see potential code
> > improvements/cleanups and locking improvements, but I was going to
> > leave those to a later series.
>
> Right, the simplest approach on top of the current PMD collapse is to do
> exactly what we do in the PMD case, including the locking: which
> apparently is no completely the same yet :).
>
> Instead of installing a PMD THP, we modify the page table and remap that.
>
> Moving from the PMD lock to the PTE lock will not make a big change in
> practice for most cases: we already must disable essentially all page
> table walkers (vma lock, mmap lock in write, rmap lock in write).
>
> The PMDP clear+flush is primarily to disable the last possible set of
> page table walkers: (1) HW modifications and (2) GUP-fast.
>
> So after the PMDP clear+flush we know that (A) HW can not modify the
> pages concurrently and (B) GUP-fast cannot succeed anymore.
>
> The issue with PTEP clear+flush is that we will have to remember all PTE
> values, to reset them if anything goes wrong. Using a single PMD value
> is arguably simpler. And then, the benefit vs. complexity is unclear.
>
> Certainly something to look into later, but not a requirement for the
> first support,

As I understand, one rule we currently have in MM is that an operation
that logically operates on one VMA (VMA 1) does not touch the page
tables of other VMAs (VMA 2) in any way, except that it may walk page
tables that cover address space that intersects with both VMA 1 and
VMA 2, and create such page tables if they are missing.

This proposed patch changes that, without explicitly discussing this
locking change.

Just as one example: I think this patch retracts a page table without
VMA-locking the relevant address space (we hold a VMA lock on VMA 1,
but not on VMA 2), and we then drop the PMD lock after (temporarily)
retracting the page table. At that point, I think a racing fault that
uses the VMA-locked fastpath can observe the empty PMD, and can
install a new page table? Then when collapse_huge_page() tries to
re-add the retracted page table, I think we'll get a BUG_ON(). Similar
thing with concurrent ftruncate() or such trying to zap PTEs, we can
probably end up not zapping PTEs that should have been zapped?

> The real challenge/benefit will be looking into avoiding taking all the
> heavy weight locks. Dev has already been thinking about that. For mTHP
> it might be easier than for THPs. Probably it will involve setting PTE
> migration entries whenever we drop the PTL, and dealing with the
> possibility of concurrent zapping of these migration entries.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-02 12:50         ` Jann Horn
@ 2025-05-02 15:18           ` David Hildenbrand
  2025-05-02 15:24             ` Lorenzo Stoakes
  2025-05-02 15:26             ` Jann Horn
  0 siblings, 2 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-05-02 15:18 UTC (permalink / raw)
  To: Jann Horn, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, baohua, baolin.wang,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On 02.05.25 14:50, Jann Horn wrote:
> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>> On 02.05.25 00:29, Nico Pache wrote:
>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>
>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>
>>>>> After the scan is complete we will perform binary recursion on the
>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>> determine how full a THP must be to be eligible.
>>>>>
>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>> pages, we dont perform the collapse.
>>>> [...]
>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>           vma_start_write(vma);
>>>>>           anon_vma_lock_write(vma->anon_vma);
>>>>>
>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>           mmu_notifier_invalidate_range_start(&range);
>>>>>
>>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>> +
>>>>>           /*
>>>>>            * This removes any huge TLB entry from the CPU so we won't allow
>>>>>            * huge and small TLB entries for the same virtual address to
>>>>
>>>> It's not visible in this diff, but we're about to do a
>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>> assumes that the entire page table exclusively corresponds to the
>>>> current VMA.
>>>>
>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>> happens for full-size THP, and that mTHP only tears down individual
>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>> the existing THP code tears down PTEs in a detached page table, while
>>>> mTHP would have to do it in a still-attached page table.)
>>> Hi Jann!
>>>
>>> I was under the impression that this is needed to prevent GUP-fast
>>> races (and potentially others).
> 
> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> 
>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>> the collapse, then reinstall the PMD (similarly to how the system
>>> recovers from a failed PMD collapse). I tried to keep the current
>>> locking behavior as it seemed the easiest way to get it right (and not
>>> break anything). So I keep the PMD detaching and reinstalling for the
>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>> early. I will comment further on his response.
> 
> As I see it, you're not "keeping" the current locking behavior; you're
> making a big implicit locking change by reusing a codepath designed
> for PMD THP for mTHP, where the page table may not be exclusively
> owned by one VMA.

That is not the intention. The intention in this series (at least as we 
discussed) was to not do it across VMAs; that is considered the next 
logical step (which will be especially relevant on arm64 IMHO).

> 
>>> As I familiarize myself with the code more, I do see potential code
>>> improvements/cleanups and locking improvements, but I was going to
>>> leave those to a later series.
>>
>> Right, the simplest approach on top of the current PMD collapse is to do
>> exactly what we do in the PMD case, including the locking: which
>> apparently is no completely the same yet :).
>>
>> Instead of installing a PMD THP, we modify the page table and remap that.
>>
>> Moving from the PMD lock to the PTE lock will not make a big change in
>> practice for most cases: we already must disable essentially all page
>> table walkers (vma lock, mmap lock in write, rmap lock in write).
>>
>> The PMDP clear+flush is primarily to disable the last possible set of
>> page table walkers: (1) HW modifications and (2) GUP-fast.
>>
>> So after the PMDP clear+flush we know that (A) HW can not modify the
>> pages concurrently and (B) GUP-fast cannot succeed anymore.
>>
>> The issue with PTEP clear+flush is that we will have to remember all PTE
>> values, to reset them if anything goes wrong. Using a single PMD value
>> is arguably simpler. And then, the benefit vs. complexity is unclear.
>>
>> Certainly something to look into later, but not a requirement for the
>> first support,
> 
> As I understand, one rule we currently have in MM is that an operation
> that logically operates on one VMA (VMA 1) does not touch the page
> tables of other VMAs (VMA 2) in any way, except that it may walk page
> tables that cover address space that intersects with both VMA 1 and
> VMA 2, and create such page tables if they are missing.

Yes, absolutely. That must not happen. And I think I raised it as a 
problem in reply to one of Dev's series.

If this series does not rely on that it must be fixed.

> 
> This proposed patch changes that, without explicitly discussing this
> locking change.

Yes, that must not happen. We must not zap a PMD to temporarily replace 
it with a pmd_none() entry if any other sane page table walker could 
stumble over it.

This includes another VMA that is not write-locked that could span the PMD.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-02 15:18           ` David Hildenbrand
@ 2025-05-02 15:24             ` Lorenzo Stoakes
  2025-05-02 15:39               ` David Hildenbrand
  2025-05-02 15:26             ` Jann Horn
  1 sibling, 1 reply; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-05-02 15:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, akpm, corbet, rostedt, mhiramat,
	mathieu.desnoyers, baohua, baolin.wang, ryan.roberts, willy,
	peterx, ziy, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, dev.jain, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, Liam.Howlett

On Fri, May 02, 2025 at 05:18:54PM +0200, David Hildenbrand wrote:
> On 02.05.25 14:50, Jann Horn wrote:
> > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> > > On 02.05.25 00:29, Nico Pache wrote:
> > > > On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> > > > > > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > > > > > While scanning PMD ranges for potential collapse candidates, keep track
> > > > > > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > > > > > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > > > > > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > > > > > scan phase so we dont bailout early and miss potential mTHP candidates.
> > > > > >
> > > > > > After the scan is complete we will perform binary recursion on the
> > > > > > bitmap to determine which mTHP size would be most efficient to collapse
> > > > > > to. max_ptes_none will be scaled by the attempted collapse order to
> > > > > > determine how full a THP must be to be eligible.
> > > > > >
> > > > > > If a mTHP collapse is attempted, but contains swapped out, or shared
> > > > > > pages, we dont perform the collapse.
> > > > > [...]
> > > > > > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > > >           vma_start_write(vma);
> > > > > >           anon_vma_lock_write(vma->anon_vma);
> > > > > >
> > > > > > -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > > > -                               address + HPAGE_PMD_SIZE);
> > > > > > +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > > > > > +                               _address + (PAGE_SIZE << order));
> > > > > >           mmu_notifier_invalidate_range_start(&range);
> > > > > >
> > > > > >           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > > > +
> > > > > >           /*
> > > > > >            * This removes any huge TLB entry from the CPU so we won't allow
> > > > > >            * huge and small TLB entries for the same virtual address to
> > > > >
> > > > > It's not visible in this diff, but we're about to do a
> > > > > pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> > > > > entire page table, meaning it tears down 2MiB of address space; and it
> > > > > assumes that the entire page table exclusively corresponds to the
> > > > > current VMA.
> > > > >
> > > > > I think you'll need to ensure that the pmdp_collapse_flush() only
> > > > > happens for full-size THP, and that mTHP only tears down individual
> > > > > PTEs in the relevant range. (That code might get a bit messy, since
> > > > > the existing THP code tears down PTEs in a detached page table, while
> > > > > mTHP would have to do it in a still-attached page table.)
> > > > Hi Jann!
> > > >
> > > > I was under the impression that this is needed to prevent GUP-fast
> > > > races (and potentially others).
> >
> > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> >
> > > > As you state here, conceptually the PMD case is, detach the PMD, do
> > > > the collapse, then reinstall the PMD (similarly to how the system
> > > > recovers from a failed PMD collapse). I tried to keep the current
> > > > locking behavior as it seemed the easiest way to get it right (and not
> > > > break anything). So I keep the PMD detaching and reinstalling for the
> > > > mTHP case too. As Hugh points out I am releasing the anon lock too
> > > > early. I will comment further on his response.
> >
> > As I see it, you're not "keeping" the current locking behavior; you're
> > making a big implicit locking change by reusing a codepath designed
> > for PMD THP for mTHP, where the page table may not be exclusively
> > owned by one VMA.
>
> That is not the intention. The intention in this series (at least as we
> discussed) was to not do it across VMAs; that is considered the next logical
> step (which will be especially relevant on arm64 IMHO).
>
> >
> > > > As I familiarize myself with the code more, I do see potential code
> > > > improvements/cleanups and locking improvements, but I was going to
> > > > leave those to a later series.
> > >
> > > Right, the simplest approach on top of the current PMD collapse is to do
> > > exactly what we do in the PMD case, including the locking: which
> > > apparently is no completely the same yet :).
> > >
> > > Instead of installing a PMD THP, we modify the page table and remap that.
> > >
> > > Moving from the PMD lock to the PTE lock will not make a big change in
> > > practice for most cases: we already must disable essentially all page
> > > table walkers (vma lock, mmap lock in write, rmap lock in write).
> > >
> > > The PMDP clear+flush is primarily to disable the last possible set of
> > > page table walkers: (1) HW modifications and (2) GUP-fast.
> > >
> > > So after the PMDP clear+flush we know that (A) HW can not modify the
> > > pages concurrently and (B) GUP-fast cannot succeed anymore.
> > >
> > > The issue with PTEP clear+flush is that we will have to remember all PTE
> > > values, to reset them if anything goes wrong. Using a single PMD value
> > > is arguably simpler. And then, the benefit vs. complexity is unclear.
> > >
> > > Certainly something to look into later, but not a requirement for the
> > > first support,
> >
> > As I understand, one rule we currently have in MM is that an operation
> > that logically operates on one VMA (VMA 1) does not touch the page
> > tables of other VMAs (VMA 2) in any way, except that it may walk page
> > tables that cover address space that intersects with both VMA 1 and
> > VMA 2, and create such page tables if they are missing.
>
> Yes, absolutely. That must not happen. And I think I raised it as a problem
> in reply to one of Dev's series.
>
> If this series does not rely on that it must be fixed.
>
> >
> > This proposed patch changes that, without explicitly discussing this
> > locking change.
>
> Yes, that must not happen. We must not zap a PMD to temporarily replace it
> with a pmd_none() entry if any other sane page table walker could stumble
> over it.
>
> This includes another VMA that is not write-locked that could span the PMD.

I feel like we should document these restrictions somewhere :)

Perhaps in a new page table walker doc, or on the
https://origin.kernel.org/doc/html/latest/mm/process_addrs.html page.

Which sounds like I'm volunteering myself to do so doesn't it...

[adds to todo...]

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-02 15:18           ` David Hildenbrand
  2025-05-02 15:24             ` Lorenzo Stoakes
@ 2025-05-02 15:26             ` Jann Horn
  2025-05-02 15:30               ` Nico Pache
  1 sibling, 1 reply; 41+ messages in thread
From: Jann Horn @ 2025-05-02 15:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
	akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, baohua,
	baolin.wang, ryan.roberts, willy, peterx, ziy, wangkefeng.wang,
	usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
	kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
	catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
	surenb, zokeefe, hannes, rientjes, mhocko, rdunlap,
	lorenzo.stoakes, Liam.Howlett

On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.05.25 14:50, Jann Horn wrote:
> > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> >> On 02.05.25 00:29, Nico Pache wrote:
> >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> >>>>
> >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> >>>>> While scanning PMD ranges for potential collapse candidates, keep track
> >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> >>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
> >>>>>
> >>>>> After the scan is complete we will perform binary recursion on the
> >>>>> bitmap to determine which mTHP size would be most efficient to collapse
> >>>>> to. max_ptes_none will be scaled by the attempted collapse order to
> >>>>> determine how full a THP must be to be eligible.
> >>>>>
> >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
> >>>>> pages, we dont perform the collapse.
> >>>> [...]
> >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>>>           vma_start_write(vma);
> >>>>>           anon_vma_lock_write(vma->anon_vma);
> >>>>>
> >>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> >>>>> -                               address + HPAGE_PMD_SIZE);
> >>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> >>>>> +                               _address + (PAGE_SIZE << order));
> >>>>>           mmu_notifier_invalidate_range_start(&range);
> >>>>>
> >>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >>>>> +
> >>>>>           /*
> >>>>>            * This removes any huge TLB entry from the CPU so we won't allow
> >>>>>            * huge and small TLB entries for the same virtual address to
> >>>>
> >>>> It's not visible in this diff, but we're about to do a
> >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> >>>> entire page table, meaning it tears down 2MiB of address space; and it
> >>>> assumes that the entire page table exclusively corresponds to the
> >>>> current VMA.
> >>>>
> >>>> I think you'll need to ensure that the pmdp_collapse_flush() only
> >>>> happens for full-size THP, and that mTHP only tears down individual
> >>>> PTEs in the relevant range. (That code might get a bit messy, since
> >>>> the existing THP code tears down PTEs in a detached page table, while
> >>>> mTHP would have to do it in a still-attached page table.)
> >>> Hi Jann!
> >>>
> >>> I was under the impression that this is needed to prevent GUP-fast
> >>> races (and potentially others).
> >
> > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> >
> >>> As you state here, conceptually the PMD case is, detach the PMD, do
> >>> the collapse, then reinstall the PMD (similarly to how the system
> >>> recovers from a failed PMD collapse). I tried to keep the current
> >>> locking behavior as it seemed the easiest way to get it right (and not
> >>> break anything). So I keep the PMD detaching and reinstalling for the
> >>> mTHP case too. As Hugh points out I am releasing the anon lock too
> >>> early. I will comment further on his response.
> >
> > As I see it, you're not "keeping" the current locking behavior; you're
> > making a big implicit locking change by reusing a codepath designed
> > for PMD THP for mTHP, where the page table may not be exclusively
> > owned by one VMA.
>
> That is not the intention. The intention in this series (at least as we
> discussed) was to not do it across VMAs; that is considered the next
> logical step (which will be especially relevant on arm64 IMHO).

Ah, so for now this is supposed to only work for PTEs which are in a
PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
then try to collapse its contents to an order-2 mTHP page, that should
just not work?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-02 15:26             ` Jann Horn
@ 2025-05-02 15:30               ` Nico Pache
  2025-05-02 15:42                 ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Nico Pache @ 2025-05-02 15:30 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Hildenbrand, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, akpm, corbet, rostedt, mhiramat,
	mathieu.desnoyers, baohua, baolin.wang, ryan.roberts, willy,
	peterx, ziy, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, dev.jain, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 02.05.25 14:50, Jann Horn wrote:
> > > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> > >> On 02.05.25 00:29, Nico Pache wrote:
> > >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> > >>>>
> > >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> > >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > >>>>> While scanning PMD ranges for potential collapse candidates, keep track
> > >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> > >>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
> > >>>>>
> > >>>>> After the scan is complete we will perform binary recursion on the
> > >>>>> bitmap to determine which mTHP size would be most efficient to collapse
> > >>>>> to. max_ptes_none will be scaled by the attempted collapse order to
> > >>>>> determine how full a THP must be to be eligible.
> > >>>>>
> > >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
> > >>>>> pages, we dont perform the collapse.
> > >>>> [...]
> > >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >>>>>           vma_start_write(vma);
> > >>>>>           anon_vma_lock_write(vma->anon_vma);
> > >>>>>
> > >>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > >>>>> -                               address + HPAGE_PMD_SIZE);
> > >>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > >>>>> +                               _address + (PAGE_SIZE << order));
> > >>>>>           mmu_notifier_invalidate_range_start(&range);
> > >>>>>
> > >>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > >>>>> +
> > >>>>>           /*
> > >>>>>            * This removes any huge TLB entry from the CPU so we won't allow
> > >>>>>            * huge and small TLB entries for the same virtual address to
> > >>>>
> > >>>> It's not visible in this diff, but we're about to do a
> > >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> > >>>> entire page table, meaning it tears down 2MiB of address space; and it
> > >>>> assumes that the entire page table exclusively corresponds to the
> > >>>> current VMA.
> > >>>>
> > >>>> I think you'll need to ensure that the pmdp_collapse_flush() only
> > >>>> happens for full-size THP, and that mTHP only tears down individual
> > >>>> PTEs in the relevant range. (That code might get a bit messy, since
> > >>>> the existing THP code tears down PTEs in a detached page table, while
> > >>>> mTHP would have to do it in a still-attached page table.)
> > >>> Hi Jann!
> > >>>
> > >>> I was under the impression that this is needed to prevent GUP-fast
> > >>> races (and potentially others).
> > >
> > > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> > >
> > >>> As you state here, conceptually the PMD case is, detach the PMD, do
> > >>> the collapse, then reinstall the PMD (similarly to how the system
> > >>> recovers from a failed PMD collapse). I tried to keep the current
> > >>> locking behavior as it seemed the easiest way to get it right (and not
> > >>> break anything). So I keep the PMD detaching and reinstalling for the
> > >>> mTHP case too. As Hugh points out I am releasing the anon lock too
> > >>> early. I will comment further on his response.
> > >
> > > As I see it, you're not "keeping" the current locking behavior; you're
> > > making a big implicit locking change by reusing a codepath designed
> > > for PMD THP for mTHP, where the page table may not be exclusively
> > > owned by one VMA.
> >
> > That is not the intention. The intention in this series (at least as we
> > discussed) was to not do it across VMAs; that is considered the next
> > logical step (which will be especially relevant on arm64 IMHO).
>
> Ah, so for now this is supposed to only work for PTEs which are in a
> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
> then try to collapse its contents to an order-2 mTHP page, that should
> just not work?
Correct! As I started in reply to Hugh, the locking conditions explode
if we drop that requirement. A simple workaround we've considered is
only collapsing if a single VMA intersects a PMD. I can make sure this
is more clear in the coverletter + this patch.

 Cheers,
-- Nico
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-02 15:24             ` Lorenzo Stoakes
@ 2025-05-02 15:39               ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-05-02 15:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jann Horn, Nico Pache, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, akpm, corbet, rostedt, mhiramat,
	mathieu.desnoyers, baohua, baolin.wang, ryan.roberts, willy,
	peterx, ziy, wangkefeng.wang, usamaarif642, sunnanyong,
	vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
	raquini, dev.jain, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, Liam.Howlett

On 02.05.25 17:24, Lorenzo Stoakes wrote:
> On Fri, May 02, 2025 at 05:18:54PM +0200, David Hildenbrand wrote:
>> On 02.05.25 14:50, Jann Horn wrote:
>>> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>>>> On 02.05.25 00:29, Nico Pache wrote:
>>>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>>>
>>>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>>>
>>>>>>> After the scan is complete we will perform binary recursion on the
>>>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>>>> determine how full a THP must be to be eligible.
>>>>>>>
>>>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>>>> pages, we dont perform the collapse.
>>>>>> [...]
>>>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>>>            vma_start_write(vma);
>>>>>>>            anon_vma_lock_write(vma->anon_vma);
>>>>>>>
>>>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>>>            mmu_notifier_invalidate_range_start(&range);
>>>>>>>
>>>>>>>            pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>>>> +
>>>>>>>            /*
>>>>>>>             * This removes any huge TLB entry from the CPU so we won't allow
>>>>>>>             * huge and small TLB entries for the same virtual address to
>>>>>>
>>>>>> It's not visible in this diff, but we're about to do a
>>>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>>>> assumes that the entire page table exclusively corresponds to the
>>>>>> current VMA.
>>>>>>
>>>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>>>> happens for full-size THP, and that mTHP only tears down individual
>>>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>>>> the existing THP code tears down PTEs in a detached page table, while
>>>>>> mTHP would have to do it in a still-attached page table.)
>>>>> Hi Jann!
>>>>>
>>>>> I was under the impression that this is needed to prevent GUP-fast
>>>>> races (and potentially others).
>>>
>>> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
>>>
>>>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>>>> the collapse, then reinstall the PMD (similarly to how the system
>>>>> recovers from a failed PMD collapse). I tried to keep the current
>>>>> locking behavior as it seemed the easiest way to get it right (and not
>>>>> break anything). So I keep the PMD detaching and reinstalling for the
>>>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>>>> early. I will comment further on his response.
>>>
>>> As I see it, you're not "keeping" the current locking behavior; you're
>>> making a big implicit locking change by reusing a codepath designed
>>> for PMD THP for mTHP, where the page table may not be exclusively
>>> owned by one VMA.
>>
>> That is not the intention. The intention in this series (at least as we
>> discussed) was to not do it across VMAs; that is considered the next logical
>> step (which will be especially relevant on arm64 IMHO).
>>
>>>
>>>>> As I familiarize myself with the code more, I do see potential code
>>>>> improvements/cleanups and locking improvements, but I was going to
>>>>> leave those to a later series.
>>>>
>>>> Right, the simplest approach on top of the current PMD collapse is to do
>>>> exactly what we do in the PMD case, including the locking: which
>>>> apparently is no completely the same yet :).
>>>>
>>>> Instead of installing a PMD THP, we modify the page table and remap that.
>>>>
>>>> Moving from the PMD lock to the PTE lock will not make a big change in
>>>> practice for most cases: we already must disable essentially all page
>>>> table walkers (vma lock, mmap lock in write, rmap lock in write).
>>>>
>>>> The PMDP clear+flush is primarily to disable the last possible set of
>>>> page table walkers: (1) HW modifications and (2) GUP-fast.
>>>>
>>>> So after the PMDP clear+flush we know that (A) HW can not modify the
>>>> pages concurrently and (B) GUP-fast cannot succeed anymore.
>>>>
>>>> The issue with PTEP clear+flush is that we will have to remember all PTE
>>>> values, to reset them if anything goes wrong. Using a single PMD value
>>>> is arguably simpler. And then, the benefit vs. complexity is unclear.
>>>>
>>>> Certainly something to look into later, but not a requirement for the
>>>> first support,
>>>
>>> As I understand, one rule we currently have in MM is that an operation
>>> that logically operates on one VMA (VMA 1) does not touch the page
>>> tables of other VMAs (VMA 2) in any way, except that it may walk page
>>> tables that cover address space that intersects with both VMA 1 and
>>> VMA 2, and create such page tables if they are missing.
>>
>> Yes, absolutely. That must not happen. And I think I raised it as a problem
>> in reply to one of Dev's series.
>>
>> If this series does not rely on that it must be fixed.
>>
>>>
>>> This proposed patch changes that, without explicitly discussing this
>>> locking change.
>>
>> Yes, that must not happen. We must not zap a PMD to temporarily replace it
>> with a pmd_none() entry if any other sane page table walker could stumble
>> over it.
>>
>> This includes another VMA that is not write-locked that could span the PMD.
> 
> I feel like we should document these restrictions somewhere :)
> 
> Perhaps in a new page table walker doc, or on the
> https://origin.kernel.org/doc/html/latest/mm/process_addrs.html page.
> 
> Which sounds like I'm volunteering myself to do so doesn't it...
> 
> [adds to todo...]

:) that would be nice.

Yeah, I mean this is very subtle, but essentially: unless you exclude 
all page table walkers (well, okay, HW and GUP-fast are a bit special), 
temporarily replacing a present PTE by pte_none() will cause trouble. 
Same for PMDs.

That's also one of the problems of looking into only using the PTE table 
lock and not any other heavy-weight locking in khugepaged. As soon as we 
temporarily unlock the PTE table, but have to temporarily unmap the PTEs 
(->pte_none()) as well, we need something else (e.g., migration entry) 
to tell other page table walkers to back off and wait for us to re-take 
the PTL lock and finish. Zapping must still be able to continue, at 
which point it gets hairy ...

... which is why I'm hoping that we can play with that once we have the 
basics running. With the intend of reusing the existing locking scheme 
-> single VMA spans the PMD.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 07/12] khugepaged: add mTHP support
  2025-05-02 15:30               ` Nico Pache
@ 2025-05-02 15:42                 ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-05-02 15:42 UTC (permalink / raw)
  To: Nico Pache, Jann Horn
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, baohua, baolin.wang,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, dev.jain, anshuman.khandual, catalin.marinas,
	tiwai, will, dave.hansen, jack, cl, jglisse, surenb, zokeefe,
	hannes, rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett

On 02.05.25 17:30, Nico Pache wrote:
> On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@google.com> wrote:
>>
>> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 02.05.25 14:50, Jann Horn wrote:
>>>> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>>>>> On 02.05.25 00:29, Nico Pache wrote:
>>>>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>>>>
>>>>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>>>>
>>>>>>>> After the scan is complete we will perform binary recursion on the
>>>>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>>>>> determine how full a THP must be to be eligible.
>>>>>>>>
>>>>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>>>>> pages, we dont perform the collapse.
>>>>>>> [...]
>>>>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>>>>            vma_start_write(vma);
>>>>>>>>            anon_vma_lock_write(vma->anon_vma);
>>>>>>>>
>>>>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>>>>            mmu_notifier_invalidate_range_start(&range);
>>>>>>>>
>>>>>>>>            pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>>>>> +
>>>>>>>>            /*
>>>>>>>>             * This removes any huge TLB entry from the CPU so we won't allow
>>>>>>>>             * huge and small TLB entries for the same virtual address to
>>>>>>>
>>>>>>> It's not visible in this diff, but we're about to do a
>>>>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>>>>> assumes that the entire page table exclusively corresponds to the
>>>>>>> current VMA.
>>>>>>>
>>>>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>>>>> happens for full-size THP, and that mTHP only tears down individual
>>>>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>>>>> the existing THP code tears down PTEs in a detached page table, while
>>>>>>> mTHP would have to do it in a still-attached page table.)
>>>>>> Hi Jann!
>>>>>>
>>>>>> I was under the impression that this is needed to prevent GUP-fast
>>>>>> races (and potentially others).
>>>>
>>>> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
>>>>
>>>>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>>>>> the collapse, then reinstall the PMD (similarly to how the system
>>>>>> recovers from a failed PMD collapse). I tried to keep the current
>>>>>> locking behavior as it seemed the easiest way to get it right (and not
>>>>>> break anything). So I keep the PMD detaching and reinstalling for the
>>>>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>>>>> early. I will comment further on his response.
>>>>
>>>> As I see it, you're not "keeping" the current locking behavior; you're
>>>> making a big implicit locking change by reusing a codepath designed
>>>> for PMD THP for mTHP, where the page table may not be exclusively
>>>> owned by one VMA.
>>>
>>> That is not the intention. The intention in this series (at least as we
>>> discussed) was to not do it across VMAs; that is considered the next
>>> logical step (which will be especially relevant on arm64 IMHO).
>>
>> Ah, so for now this is supposed to only work for PTEs which are in a
>> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
>> then try to collapse its contents to an order-2 mTHP page, that should
>> just not work?
> Correct! As I started in reply to Hugh, the locking conditions explode
> if we drop that requirement.

Right. Adding to that, one could evaluate how much we would gain by 
trying to lock, say, up to $MAGIC_NUMBER related VMAs.

Of course, if no other VMA spans the PMD, and the VMA only covers it 
partially, it is likely still fine as long as we hold the mmap lock in 
write mode.

But probably, looking into a different locking scheme would be 
beneficial at this point.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-04-30 18:56     ` Nico Pache
  2025-05-01 23:03       ` Nico Pache
@ 2025-06-06 16:37       ` Dev Jain
  2025-06-07 12:48         ` Nico Pache
  1 sibling, 1 reply; 41+ messages in thread
From: Dev Jain @ 2025-06-06 16:37 UTC (permalink / raw)
  To: Nico Pache, Baolin Wang
  Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
	corbet, rostedt, mhiramat, mathieu.desnoyers, david, baohua,
	ryan.roberts, willy, peterx, ziy, wangkefeng.wang, usamaarif642,
	sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
	aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
	will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
	rientjes, mhocko, rdunlap, lorenzo.stoakes, Liam.Howlett


On 01/05/25 12:26 am, Nico Pache wrote:
> On Wed, Apr 30, 2025 at 4:08 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>> On 2025/4/29 02:12, Nico Pache wrote:
>>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
>>> To add mTHP support we use this scan to instead record chunks of utilized
>>> sections of the PMD.
>>>
>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>>> that represents chunks of utilized regions. We can then determine what
>>> mTHP size fits best and in the following patch, we set this bitmap while
>>> scanning the anon PMD.
>>>
>>> max_ptes_none is used as a scale to determine how "full" an order must
>>> be before being considered for collapse.
>>>
>>> When attempting to collapse an order that has its order set to "always"
>>> lets always collapse to that order in a greedy manner without
>>> considering the number of bits set.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>    include/linux/khugepaged.h |  4 ++
>>>    mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
>>>    2 files changed, 89 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>>> index 1f46046080f5..18fe6eb5051d 100644
>>> --- a/include/linux/khugepaged.h
>>> +++ b/include/linux/khugepaged.h
>>> @@ -1,6 +1,10 @@
>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>    #ifndef _LINUX_KHUGEPAGED_H
>>>    #define _LINUX_KHUGEPAGED_H
>>> +#define KHUGEPAGED_MIN_MTHP_ORDER    2
>> Still better to add some comments to explain explicitly why choose 2 as
>> the MIN_MTHP_ORDER.
> Ok i'll add a note that explicitly states that the min order of anon mTHPs is 2
>>> +#define KHUGEPAGED_MIN_MTHP_NR       (1<<KHUGEPAGED_MIN_MTHP_ORDER)
>>> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
>>> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
>>>
>>>    extern unsigned int khugepaged_max_ptes_none __read_mostly;
>>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index e21998a06253..6e67db86409a 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>>
>>>    static struct kmem_cache *mm_slot_cache __ro_after_init;
>>>
>>> +struct scan_bit_state {
>>> +     u8 order;
>>> +     u16 offset;
>>> +};
>>> +
>>>    struct collapse_control {
>>>        bool is_khugepaged;
>>>
>>> @@ -102,6 +107,18 @@ struct collapse_control {
>>>
>>>        /* nodemask for allocation fallback */
>>>        nodemask_t alloc_nmask;
>>> +
>>> +     /*
>>> +      * bitmap used to collapse mTHP sizes.
>>> +      * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
>>> +      */
>>> +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
>>> +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>>> +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
>>> +};
>>> +
>>> +struct collapse_control khugepaged_collapse_control = {
>>> +     .is_khugepaged = true,
>>>    };
>>>
>>>    /**
>>> @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
>>>        remove_wait_queue(&khugepaged_wait, &wait);
>>>    }
>>>
>>> -struct collapse_control khugepaged_collapse_control = {
>>> -     .is_khugepaged = true,
>>> -};
>>> -
>>>    static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>>>    {
>>>        int i;
>>> @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>>
>>>    static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                              int referenced, int unmapped,
>>> -                           struct collapse_control *cc)
>>> +                           struct collapse_control *cc, bool *mmap_locked,
>>> +                               u8 order, u16 offset)
>>>    {
>>>        LIST_HEAD(compound_pagelist);
>>>        pmd_t *pmd, _pmd;
>>> @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * The allocation can take potentially a long time if it involves
>>>         * sync compaction, and we do not need to hold the mmap_lock during
>>>         * that. We will recheck the vma after taking it again in write mode.
>>> +      * If collapsing mTHPs we may have already released the read_lock.
>>>         */
>>> -     mmap_read_unlock(mm);
>>> +     if (*mmap_locked) {
>>> +             mmap_read_unlock(mm);
>>> +             *mmap_locked = false;
>>> +     }
>>>
>>>        result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>>        if (result != SCAN_SUCCEED)
>>> @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>    out_up_write:
>>>        mmap_write_unlock(mm);
>>>    out_nolock:
>>> +     *mmap_locked = false;
>>>        if (folio)
>>>                folio_put(folio);
>>>        trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>>>        return result;
>>>    }
>>>
>>> +// Recursive function to consume the bitmap
>> Nit: please use '/* Xxxx */' for comments in this patch.
>>
>>> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
>>> +                     int referenced, int unmapped, struct collapse_control *cc,
>>> +                     bool *mmap_locked, unsigned long enabled_orders)
>>> +{
>>> +     u8 order, next_order;
>>> +     u16 offset, mid_offset;
>>> +     int num_chunks;
>>> +     int bits_set, threshold_bits;
>>> +     int top = -1;
>>> +     int collapsed = 0;
>>> +     int ret;
>>> +     struct scan_bit_state state;
>>> +     bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
>>> +
>>> +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
>>> +             { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
>>> +
>>> +     while (top >= 0) {
>>> +             state = cc->mthp_bitmap_stack[top--];
>>> +             order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
>>> +             offset = state.offset;
>>> +             num_chunks = 1 << (state.order);
>>> +             // Skip mTHP orders that are not enabled
>>> +             if (!test_bit(order, &enabled_orders))
>>> +                     goto next;
>>> +
>>> +             // copy the relavant section to a new bitmap
>>> +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
>>> +                               MTHP_BITMAP_SIZE);
>>> +
>>> +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
>>> +             threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
>>> +                             >> (HPAGE_PMD_ORDER - state.order);
>>> +
>>> +             //Check if the region is "almost full" based on the threshold
>>> +             if (bits_set > threshold_bits || is_pmd_only
>>> +                     || test_bit(order, &huge_anon_orders_always)) {
>> When testing this patch, I disabled the PMD-sized THP and enabled
>> 64K-sized mTHP, but it still attempts to collapse into a PMD-sized THP
>> (since bits_set > threshold_bits is ture). This doesn't seem reasonable?
> We are still required to have PMD enabled for mTHP collapse to work.
> It's a limitation of the current khugepaged code (it currently only
> adds mm_slots when PMD is enabled).
> We've discussed this in the past and are looking for a proper way
> forward, but the solution becomes tricky.

Not sure if this is still a problem, but does this patch solve
it?

https://lore.kernel.org/all/20250211111326.14295-12-dev.jain@arm.com/

>
> However I'm surprised that it still collapses due to the code below.
> I'll test this out later today.
>      +             if (!test_bit(order, &enabled_orders))
>      +                     goto next;
>>> +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
>>> +                                     mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
>>> +                     if (ret == SCAN_SUCCEED) {
>>> +                             collapsed += (1 << order);
>>> +                             continue;
>>> +                     }
>>> +             }
>>> +
>>> +next:
>>> +             if (state.order > 0) {
>>> +                     next_order = state.order - 1;
>>> +                     mid_offset = offset + (num_chunks / 2);
>>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
>>> +                             { next_order, mid_offset };
>>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
>>> +                             { next_order, offset };
>>> +                     }
>>> +     }
>>> +     return collapsed;
>>> +}
>>> +
>>>    static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                                   struct vm_area_struct *vma,
>>>                                   unsigned long address, bool *mmap_locked,
>>> @@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>        pte_unmap_unlock(pte, ptl);
>>>        if (result == SCAN_SUCCEED) {
>>>                result = collapse_huge_page(mm, address, referenced,
>>> -                                         unmapped, cc);
>>> -             /* collapse_huge_page will return with the mmap_lock released */
>>> -             *mmap_locked = false;
>>> +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>>>        }
>>>    out:
>>>        trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-06-06 16:37       ` Dev Jain
@ 2025-06-07 12:48         ` Nico Pache
  0 siblings, 0 replies; 41+ messages in thread
From: Nico Pache @ 2025-06-07 12:48 UTC (permalink / raw)
  To: Dev Jain
  Cc: Baolin Wang, linux-mm, linux-doc, linux-kernel,
	linux-trace-kernel, akpm, corbet, rostedt, mhiramat,
	mathieu.desnoyers, david, baohua, ryan.roberts, willy, peterx,
	ziy, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
	thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
	anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
	jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
	rdunlap, lorenzo.stoakes, Liam.Howlett

On Fri, Jun 6, 2025 at 10:38 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 01/05/25 12:26 am, Nico Pache wrote:
> > On Wed, Apr 30, 2025 at 4:08 AM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >>
> >> On 2025/4/29 02:12, Nico Pache wrote:
> >>> khugepaged scans anons PMD ranges for potential collapse to a hugepage.
> >>> To add mTHP support we use this scan to instead record chunks of utilized
> >>> sections of the PMD.
> >>>
> >>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> >>> that represents chunks of utilized regions. We can then determine what
> >>> mTHP size fits best and in the following patch, we set this bitmap while
> >>> scanning the anon PMD.
> >>>
> >>> max_ptes_none is used as a scale to determine how "full" an order must
> >>> be before being considered for collapse.
> >>>
> >>> When attempting to collapse an order that has its order set to "always"
> >>> lets always collapse to that order in a greedy manner without
> >>> considering the number of bits set.
> >>>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>> ---
> >>>    include/linux/khugepaged.h |  4 ++
> >>>    mm/khugepaged.c            | 94 ++++++++++++++++++++++++++++++++++----
> >>>    2 files changed, 89 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> >>> index 1f46046080f5..18fe6eb5051d 100644
> >>> --- a/include/linux/khugepaged.h
> >>> +++ b/include/linux/khugepaged.h
> >>> @@ -1,6 +1,10 @@
> >>>    /* SPDX-License-Identifier: GPL-2.0 */
> >>>    #ifndef _LINUX_KHUGEPAGED_H
> >>>    #define _LINUX_KHUGEPAGED_H
> >>> +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> >> Still better to add some comments to explain explicitly why choose 2 as
> >> the MIN_MTHP_ORDER.
> > Ok i'll add a note that explicitly states that the min order of anon mTHPs is 2
> >>> +#define KHUGEPAGED_MIN_MTHP_NR       (1<<KHUGEPAGED_MIN_MTHP_ORDER)
> >>> +#define MAX_MTHP_BITMAP_SIZE  (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
> >>> +#define MTHP_BITMAP_SIZE  (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
> >>>
> >>>    extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index e21998a06253..6e67db86409a 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >>>
> >>>    static struct kmem_cache *mm_slot_cache __ro_after_init;
> >>>
> >>> +struct scan_bit_state {
> >>> +     u8 order;
> >>> +     u16 offset;
> >>> +};
> >>> +
> >>>    struct collapse_control {
> >>>        bool is_khugepaged;
> >>>
> >>> @@ -102,6 +107,18 @@ struct collapse_control {
> >>>
> >>>        /* nodemask for allocation fallback */
> >>>        nodemask_t alloc_nmask;
> >>> +
> >>> +     /*
> >>> +      * bitmap used to collapse mTHP sizes.
> >>> +      * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
> >>> +      */
> >>> +     DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> >>> +     DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >>> +     struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> >>> +};
> >>> +
> >>> +struct collapse_control khugepaged_collapse_control = {
> >>> +     .is_khugepaged = true,
> >>>    };
> >>>
> >>>    /**
> >>> @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void)
> >>>        remove_wait_queue(&khugepaged_wait, &wait);
> >>>    }
> >>>
> >>> -struct collapse_control khugepaged_collapse_control = {
> >>> -     .is_khugepaged = true,
> >>> -};
> >>> -
> >>>    static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> >>>    {
> >>>        int i;
> >>> @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >>>
> >>>    static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>                              int referenced, int unmapped,
> >>> -                           struct collapse_control *cc)
> >>> +                           struct collapse_control *cc, bool *mmap_locked,
> >>> +                               u8 order, u16 offset)
> >>>    {
> >>>        LIST_HEAD(compound_pagelist);
> >>>        pmd_t *pmd, _pmd;
> >>> @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>         * The allocation can take potentially a long time if it involves
> >>>         * sync compaction, and we do not need to hold the mmap_lock during
> >>>         * that. We will recheck the vma after taking it again in write mode.
> >>> +      * If collapsing mTHPs we may have already released the read_lock.
> >>>         */
> >>> -     mmap_read_unlock(mm);
> >>> +     if (*mmap_locked) {
> >>> +             mmap_read_unlock(mm);
> >>> +             *mmap_locked = false;
> >>> +     }
> >>>
> >>>        result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >>>        if (result != SCAN_SUCCEED)
> >>> @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>    out_up_write:
> >>>        mmap_write_unlock(mm);
> >>>    out_nolock:
> >>> +     *mmap_locked = false;
> >>>        if (folio)
> >>>                folio_put(folio);
> >>>        trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> >>>        return result;
> >>>    }
> >>>
> >>> +// Recursive function to consume the bitmap
> >> Nit: please use '/* Xxxx */' for comments in this patch.
> >>
> >>> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> >>> +                     int referenced, int unmapped, struct collapse_control *cc,
> >>> +                     bool *mmap_locked, unsigned long enabled_orders)
> >>> +{
> >>> +     u8 order, next_order;
> >>> +     u16 offset, mid_offset;
> >>> +     int num_chunks;
> >>> +     int bits_set, threshold_bits;
> >>> +     int top = -1;
> >>> +     int collapsed = 0;
> >>> +     int ret;
> >>> +     struct scan_bit_state state;
> >>> +     bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> >>> +
> >>> +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> >>> +             { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
> >>> +
> >>> +     while (top >= 0) {
> >>> +             state = cc->mthp_bitmap_stack[top--];
> >>> +             order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> >>> +             offset = state.offset;
> >>> +             num_chunks = 1 << (state.order);
> >>> +             // Skip mTHP orders that are not enabled
> >>> +             if (!test_bit(order, &enabled_orders))
> >>> +                     goto next;
> >>> +
> >>> +             // copy the relavant section to a new bitmap
> >>> +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> >>> +                               MTHP_BITMAP_SIZE);
> >>> +
> >>> +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> >>> +             threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> >>> +                             >> (HPAGE_PMD_ORDER - state.order);
> >>> +
> >>> +             //Check if the region is "almost full" based on the threshold
> >>> +             if (bits_set > threshold_bits || is_pmd_only
> >>> +                     || test_bit(order, &huge_anon_orders_always)) {
> >> When testing this patch, I disabled the PMD-sized THP and enabled
> >> 64K-sized mTHP, but it still attempts to collapse into a PMD-sized THP
> >> (since bits_set > threshold_bits is ture). This doesn't seem reasonable?
> > We are still required to have PMD enabled for mTHP collapse to work.
> > It's a limitation of the current khugepaged code (it currently only
> > adds mm_slots when PMD is enabled).
> > We've discussed this in the past and are looking for a proper way
> > forward, but the solution becomes tricky.
>
> Not sure if this is still a problem, but does this patch solve
> it?
>
> https://lore.kernel.org/all/20250211111326.14295-12-dev.jain@arm.com/

Hi Dev,

Baolin sent out a patch to do something similar to what you did here
based on my changes. I was going to keep the original behavior of
activating khugepaged only if the PMD size is enabled, and make that
change separately (outside this series), but I've gone ahead and
applied/tested Baolin's patch.

Sorry I had forgotten you already had a solution for this.

Cheers,
-- Nico
>
> >
> > However I'm surprised that it still collapses due to the code below.
> > I'll test this out later today.
> >      +             if (!test_bit(order, &enabled_orders))
> >      +                     goto next;
> >>> +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> >>> +                                     mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR);
> >>> +                     if (ret == SCAN_SUCCEED) {
> >>> +                             collapsed += (1 << order);
> >>> +                             continue;
> >>> +                     }
> >>> +             }
> >>> +
> >>> +next:
> >>> +             if (state.order > 0) {
> >>> +                     next_order = state.order - 1;
> >>> +                     mid_offset = offset + (num_chunks / 2);
> >>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> >>> +                             { next_order, mid_offset };
> >>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> >>> +                             { next_order, offset };
> >>> +                     }
> >>> +     }
> >>> +     return collapsed;
> >>> +}
> >>> +
> >>>    static int khugepaged_scan_pmd(struct mm_struct *mm,
> >>>                                   struct vm_area_struct *vma,
> >>>                                   unsigned long address, bool *mmap_locked,
> >>> @@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >>>        pte_unmap_unlock(pte, ptl);
> >>>        if (result == SCAN_SUCCEED) {
> >>>                result = collapse_huge_page(mm, address, referenced,
> >>> -                                         unmapped, cc);
> >>> -             /* collapse_huge_page will return with the mmap_lock released */
> >>> -             *mmap_locked = false;
> >>> +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> >>>        }
> >>>    out:
> >>>        trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-06-07 12:48 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-04-29 13:41   ` Zi Yan
2025-04-30  7:47   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-04-30  7:56   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-04-28 18:12 ` [PATCH v5 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-04-30 10:08   ` Baolin Wang
2025-04-30 18:56     ` Nico Pache
2025-05-01 23:03       ` Nico Pache
2025-05-02  1:23         ` Baolin Wang
2025-06-06 16:37       ` Dev Jain
2025-06-07 12:48         ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
2025-04-30 20:51   ` Jann Horn
2025-05-01 22:29     ` Nico Pache
2025-05-02  6:29       ` David Hildenbrand
2025-05-02 12:50         ` Jann Horn
2025-05-02 15:18           ` David Hildenbrand
2025-05-02 15:24             ` Lorenzo Stoakes
2025-05-02 15:39               ` David Hildenbrand
2025-05-02 15:26             ` Jann Horn
2025-05-02 15:30               ` Nico Pache
2025-05-02 15:42                 ` David Hildenbrand
2025-05-01 12:58   ` Hugh Dickins
2025-05-01 16:15     ` Andrew Morton
2025-05-01 22:30     ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-04-30 10:09   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-04-30 10:12   ` Baolin Wang
2025-04-30 18:43     ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-04-28 18:12 ` [PATCH v5 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-04-29  5:10   ` Bagas Sanjaya
2025-04-29 16:38   ` Christoph Lameter (Ampere)
2025-04-29 17:15     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).