* [PATCH v10 00/13] khugepaged: mTHP support
@ 2025-08-19 13:41 Nico Pache
2025-08-19 13:41 ` [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
` (15 more replies)
0 siblings, 16 replies; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.
To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
PMD scan is done, we do binary recursion on the bitmap to find the optimal
mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
during the scan, to make sure we account for the whole PMD range. When no
mTHP size is enabled, the legacy behavior of khugepaged is maintained.
max_ptes_none will be scaled by the attempted collapse order to determine
how full a mTHP must be to be eligible for the collapse to occur. If a
mTHP collapse is attempted, but contains swapped out, or shared pages, we
don't perform the collapse. It is now also possible to collapse to mTHPs
without requiring the PMD THP size to be enabled.
With the default max_ptes_none=511, the code should keep its most of its
original behavior. When enabling multiple adjacent (m)THP sizes we need to
set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
experience collapse "creep" and constantly promote mTHPs to the next
available size. This is due the fact that a collapse will introduce at
least 2x the number of pages, and on a future scan will satisfy the
promotion condition once again.
Patch 1: Refactor/rename hpage_collapse
Patch 2: Some refactoring to combine madvise_collapse and khugepaged
Patch 3-5: Generalize khugepaged functions for arbitrary orders
Patch 6-8: The mTHP patches
Patch 9-10: Allow khugepaged to operate without PMD enabled
Patch 11-12: Tracing/stats
Patch 13: Documentation
---------
Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits
while monitoring a number of stats and tracepoints. The code is
available here[1] (Run in legacy mode for these changes and set mthp
sizes to inherit)
The summary from my testings was that there was no significant
regression noticed through this test. In some cases my changes had
better collapse latencies, and was able to scan more pages in the same
amount of time/work, but for the most part the results were consistent.
- redis testing. I tested these changes along with my defer changes
(see followup [4] post for more details). We've decided to get the mTHP
changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.
V10 Changes:
- Fixed bug where bitmap tracking was off by one leading to weird behavior
in some test cases.
- Track mTHP stats for PMD order too (Baolin)
- indentation cleanup (David)
- add review/ack tags
- Improve the control flow, readability, and result handling in
collapse_scan_bitmap (Baolin)
- Indentation nits/cleanup (David)
- Converted u8 orders to unsigned int to be consistent with other folio
callers (David)
- Handled conflicts with Devs work on pte batching
- Changed SWAP/SHARED restriction comments to a TODO comment (David)
- Squashed main mTHP patch and the introduce bitmap patch (David)
- Other small nits
V9 Changes: [3]
- Drop madvise_collapse support [2]. Further discussion needed.
- Add documentation entries for new stats (Baolin)
- Fix missing stat update (MTHP_STAT_COLLAPSE_EXCEED_SWAP) that was
accidentally dropped in v7 (Baolin)
- Fix mishandled conflict noted in v8 (merged into wrong commit)
- change rename from khugepaged to collapse (Dev)
V8 Changes:
- Fix mishandled conflict with shmem config changes (Baolin)
- Add Baolin's patches for allowing collapse without PMD enabled
- Add additional patch for allowing madvise_collapse without PMD enabled
- Documentations nits (Randy)
- Simplify SCAN_ANY_PROCESS lock jumbling (Liam)
- Add a BUG_ON to the mTHP collapse similar to PMD (Dev)
- Remove doc comment about khugepaged PMD only limitation (Dev)
- Change revalidation function to accept multiple orders
- Handled conflicts introduced by Lorenzo's madvise changes
V7 (RESEND)
V6 Changes:
- Dont release the anon_vma_lock early (like in the PMD case), as not all
pages are isolated.
- Define the PTE as null to avoid a uninitilized condition
- minor nits and newline cleanup
- make sure to unmap and unlock the pte for the swapin case
- change the revalidation to always check the PMD order (as this will make
sure that no other VMA spans it)
V5 Changes:
- switched the order of patches 1 and 2
- fixed some edge cases on the unified madvise_collapse and khugepaged
- Explained the "creep" some more in the docs
- fix EXCEED_SHARED vs EXCEED_SWAP accounting issue
- fix potential highmem issue caused by a early unmap of the PTE
V4 Changes:
- Rebased onto mm-unstable
- small changes to Documentation
V3 Changes:
- corrected legacy behavior for khugepaged and madvise_collapse
- added proper mTHP stat tracking
- Minor changes to prevent a nested lock on non-split-lock arches
- Took Devs version of alloc_charge_folio as it has the proper stats
- Skip cases were trying to collapse to a lower order would still fail
- Fixed cases were the bitmap was not being updated properly
- Moved Documentation update to this series instead of the defer set
- Minor bugs discovered during testing and review
- Minor "nit" cleanup
V2 Changes:
- Minor bug fixes discovered during review and testing
- removed dynamic allocations for bitmaps, and made them stack based
- Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
- Updated trace events to include collapsing order info.
- Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
- No longer require a chunk to be fully utilized before setting the bit.
Use the same max_ptes_none scaling principle to achieve this.
- Skip mTHP collapse that requires swapin or shared handling. This helps
prevent some of the "creep" that was discovered in v1.
A big thanks to everyone that has reviewed, tested, and participated in
the development process. Its been a great experience working with all of
you on this long endeavour.
[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/all/23b8ad10-cd1f-45df-a25c-78d01c8af44f@redhat.com/
[3] - https://lore.kernel.org/lkml/20250714003207.113275-1-npache@redhat.com/
[4] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
Baolin Wang (2):
khugepaged: enable collapsing mTHPs even when PMD THPs are disabled
khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
Dev Jain (1):
khugepaged: generalize alloc_charge_folio()
Nico Pache (10):
khugepaged: rename hpage_collapse_* to collapse_*
introduce collapse_single_pmd to unify khugepaged and madvise_collapse
khugepaged: generalize hugepage_vma_revalidate for mTHP support
khugepaged: generalize __collapse_huge_page_* for mTHP support
khugepaged: add mTHP support
khugepaged: skip collapsing mTHP to smaller orders
khugepaged: avoid unnecessary mTHP collapse attempts
khugepaged: improve tracepoints for mTHP orders
khugepaged: add per-order mTHP khugepaged stats
Documentation: mm: update the admin guide for mTHP collapse
Documentation/admin-guide/mm/transhuge.rst | 44 +-
include/linux/huge_mm.h | 5 +
include/linux/khugepaged.h | 4 +
include/trace/events/huge_memory.h | 34 +-
mm/huge_memory.c | 11 +
mm/khugepaged.c | 552 +++++++++++++++------
6 files changed, 468 insertions(+), 182 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_*
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
@ 2025-08-19 13:41 ` Nico Pache
2025-08-20 10:42 ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
` (14 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
The hpage_collapse functions describe functions used by madvise_collapse
and khugepaged. remove the unnecessary hpage prefix to shorten the
function name.
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 73 ++++++++++++++++++++++++-------------------------
1 file changed, 36 insertions(+), 37 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3d4f116e14b..0e7bbadf03ee 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
kmem_cache_destroy(mm_slot_cache);
}
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int collapse_test_exit(struct mm_struct *mm)
{
return atomic_read(&mm->mm_users) == 0;
}
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
{
- return hpage_collapse_test_exit(mm) ||
+ return collapse_test_exit(mm) ||
mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
}
@@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
int wakeup;
/* __khugepaged_exit() must not run from under us */
- VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+ VM_BUG_ON_MM(collapse_test_exit(mm), mm);
if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
return;
@@ -502,7 +502,7 @@ void __khugepaged_exit(struct mm_struct *mm)
} else if (mm_slot) {
/*
* This is required to serialize against
- * hpage_collapse_test_exit() (which is guaranteed to run
+ * collapse_test_exit() (which is guaranteed to run
* under mmap sem read mode). Stop here (after we return all
* pagetables will be destroyed) until khugepaged has finished
* working on the pagetables under the mmap_lock.
@@ -592,7 +592,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
folio = page_folio(page);
VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
- /* See hpage_collapse_scan_pmd(). */
+ /* See collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
if (cc->is_khugepaged &&
@@ -848,7 +848,7 @@ struct collapse_control khugepaged_collapse_control = {
.is_khugepaged = true,
};
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool collapse_scan_abort(int nid, struct collapse_control *cc)
{
int i;
@@ -883,7 +883,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
}
#ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
{
int nid, target_node = 0, max_value = 0;
@@ -902,7 +902,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
return target_node;
}
#else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
{
return 0;
}
@@ -923,7 +923,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
TVA_FORCED_COLLAPSE;
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(collapse_test_exit_or_disable(mm)))
return SCAN_ANY_PROCESS;
*vmap = vma = find_vma(mm, address);
@@ -996,7 +996,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
/*
* Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
*
* Called and returns without pte mapped or spinlocks held.
* Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1082,7 +1082,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
{
gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
GFP_TRANSHUGE);
- int node = hpage_collapse_find_target_node(cc);
+ int node = collapse_find_target_node(cc);
struct folio *folio;
folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1268,10 +1268,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
return result;
}
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, bool *mmap_locked,
- struct collapse_control *cc)
+static int collapse_scan_pmd(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address, bool *mmap_locked,
+ struct collapse_control *cc)
{
pmd_t *pmd;
pte_t *pte, *_pte;
@@ -1382,7 +1382,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
* hit record.
*/
node = folio_nid(folio);
- if (hpage_collapse_scan_abort(node, cc)) {
+ if (collapse_scan_abort(node, cc)) {
result = SCAN_SCAN_ABORT;
goto out_unmap;
}
@@ -1451,7 +1451,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
lockdep_assert_held(&khugepaged_mm_lock);
- if (hpage_collapse_test_exit(mm)) {
+ if (collapse_test_exit(mm)) {
/* free mm_slot */
hash_del(&slot->hash);
list_del(&slot->mm_node);
@@ -1753,7 +1753,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
continue;
- if (hpage_collapse_test_exit(mm))
+ if (collapse_test_exit(mm))
continue;
/*
* When a vma is registered with uffd-wp, we cannot recycle
@@ -2275,9 +2275,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
return result;
}
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
- struct file *file, pgoff_t start,
- struct collapse_control *cc)
+static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+ struct file *file, pgoff_t start,
+ struct collapse_control *cc)
{
struct folio *folio = NULL;
struct address_space *mapping = file->f_mapping;
@@ -2332,7 +2332,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
}
node = folio_nid(folio);
- if (hpage_collapse_scan_abort(node, cc)) {
+ if (collapse_scan_abort(node, cc)) {
result = SCAN_SCAN_ABORT;
folio_put(folio);
break;
@@ -2382,7 +2382,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
return result;
}
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
+static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
struct collapse_control *cc)
__releases(&khugepaged_mm_lock)
__acquires(&khugepaged_mm_lock)
@@ -2420,7 +2420,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
goto breakouterloop_mmap_lock;
progress++;
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(collapse_test_exit_or_disable(mm)))
goto breakouterloop;
vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2428,7 +2428,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
unsigned long hstart, hend;
cond_resched();
- if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+ if (unlikely(collapse_test_exit_or_disable(mm))) {
progress++;
break;
}
@@ -2449,7 +2449,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
bool mmap_locked = true;
cond_resched();
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(collapse_test_exit_or_disable(mm)))
goto breakouterloop;
VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2462,12 +2462,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
mmap_read_unlock(mm);
mmap_locked = false;
- *result = hpage_collapse_scan_file(mm,
+ *result = collapse_scan_file(mm,
khugepaged_scan.address, file, pgoff, cc);
fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
- if (hpage_collapse_test_exit_or_disable(mm))
+ if (collapse_test_exit_or_disable(mm))
goto breakouterloop;
*result = collapse_pte_mapped_thp(mm,
khugepaged_scan.address, false);
@@ -2476,7 +2476,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
mmap_read_unlock(mm);
}
} else {
- *result = hpage_collapse_scan_pmd(mm, vma,
+ *result = collapse_scan_pmd(mm, vma,
khugepaged_scan.address, &mmap_locked, cc);
}
@@ -2509,7 +2509,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
* Release the current mm_slot if this mm is about to die, or
* if we scanned all vmas of this mm.
*/
- if (hpage_collapse_test_exit(mm) || !vma) {
+ if (collapse_test_exit(mm) || !vma) {
/*
* Make sure that if mm_users is reaching zero while
* khugepaged runs here, khugepaged_exit will find
@@ -2563,8 +2563,8 @@ static void khugepaged_do_scan(struct collapse_control *cc)
pass_through_head++;
if (khugepaged_has_work() &&
pass_through_head < 2)
- progress += khugepaged_scan_mm_slot(pages - progress,
- &result, cc);
+ progress += collapse_scan_mm_slot(pages - progress,
+ &result, cc);
else
progress = pages;
spin_unlock(&khugepaged_mm_lock);
@@ -2805,12 +2805,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
mmap_read_unlock(mm);
mmap_locked = false;
- result = hpage_collapse_scan_file(mm, addr, file, pgoff,
- cc);
+ result = collapse_scan_file(mm, addr, file, pgoff, cc);
fput(file);
} else {
- result = hpage_collapse_scan_pmd(mm, vma, addr,
- &mmap_locked, cc);
+ result = collapse_scan_pmd(mm, vma, addr,
+ &mmap_locked, cc);
}
if (!mmap_locked)
*lock_dropped = true;
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
2025-08-19 13:41 ` [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
@ 2025-08-19 13:41 ` Nico Pache
2025-08-20 11:21 ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
` (13 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.
Create collapse_single_pmd to increase code reuse and create an entry
point to these two users.
Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. This introduces a minor behavioral change
that is most likely an undiscovered bug. The current implementation of
khugepaged tests collapse_test_exit_or_disable before calling
collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
case. By unifying these two callers madvise_collapse now also performs
this check.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 94 ++++++++++++++++++++++++++-----------------------
1 file changed, 49 insertions(+), 45 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0e7bbadf03ee..b7b98aebb670 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2382,6 +2382,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
return result;
}
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int collapse_single_pmd(unsigned long addr,
+ struct vm_area_struct *vma, bool *mmap_locked,
+ struct collapse_control *cc)
+{
+ int result = SCAN_FAIL;
+ struct mm_struct *mm = vma->vm_mm;
+
+ if (!vma_is_anonymous(vma)) {
+ struct file *file = get_file(vma->vm_file);
+ pgoff_t pgoff = linear_page_index(vma, addr);
+
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+ result = collapse_scan_file(mm, addr, file, pgoff, cc);
+ fput(file);
+ if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+ mmap_read_lock(mm);
+ *mmap_locked = true;
+ if (collapse_test_exit_or_disable(mm)) {
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+ result = SCAN_ANY_PROCESS;
+ goto end;
+ }
+ result = collapse_pte_mapped_thp(mm, addr,
+ !cc->is_khugepaged);
+ if (result == SCAN_PMD_MAPPED)
+ result = SCAN_SUCCEED;
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+ }
+ } else {
+ result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
+ }
+ if (cc->is_khugepaged && result == SCAN_SUCCEED)
+ ++khugepaged_pages_collapsed;
+end:
+ return result;
+}
+
static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
struct collapse_control *cc)
__releases(&khugepaged_mm_lock)
@@ -2455,34 +2499,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
VM_BUG_ON(khugepaged_scan.address < hstart ||
khugepaged_scan.address + HPAGE_PMD_SIZE >
hend);
- if (!vma_is_anonymous(vma)) {
- struct file *file = get_file(vma->vm_file);
- pgoff_t pgoff = linear_page_index(vma,
- khugepaged_scan.address);
-
- mmap_read_unlock(mm);
- mmap_locked = false;
- *result = collapse_scan_file(mm,
- khugepaged_scan.address, file, pgoff, cc);
- fput(file);
- if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
- mmap_read_lock(mm);
- if (collapse_test_exit_or_disable(mm))
- goto breakouterloop;
- *result = collapse_pte_mapped_thp(mm,
- khugepaged_scan.address, false);
- if (*result == SCAN_PMD_MAPPED)
- *result = SCAN_SUCCEED;
- mmap_read_unlock(mm);
- }
- } else {
- *result = collapse_scan_pmd(mm, vma,
- khugepaged_scan.address, &mmap_locked, cc);
- }
-
- if (*result == SCAN_SUCCEED)
- ++khugepaged_pages_collapsed;
+ *result = collapse_single_pmd(khugepaged_scan.address,
+ vma, &mmap_locked, cc);
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
progress += HPAGE_PMD_NR;
@@ -2799,34 +2818,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
mmap_assert_locked(mm);
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
- if (!vma_is_anonymous(vma)) {
- struct file *file = get_file(vma->vm_file);
- pgoff_t pgoff = linear_page_index(vma, addr);
- mmap_read_unlock(mm);
- mmap_locked = false;
- result = collapse_scan_file(mm, addr, file, pgoff, cc);
- fput(file);
- } else {
- result = collapse_scan_pmd(mm, vma, addr,
- &mmap_locked, cc);
- }
+ result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
+
if (!mmap_locked)
*lock_dropped = true;
-handle_result:
switch (result) {
case SCAN_SUCCEED:
case SCAN_PMD_MAPPED:
++thps;
break;
- case SCAN_PTE_MAPPED_HUGEPAGE:
- BUG_ON(mmap_locked);
- mmap_read_lock(mm);
- result = collapse_pte_mapped_thp(mm, addr, true);
- mmap_read_unlock(mm);
- goto handle_result;
/* Whitelisted set of results where continuing OK */
+ case SCAN_PTE_MAPPED_HUGEPAGE:
case SCAN_PMD_NULL:
case SCAN_PTE_NON_PRESENT:
case SCAN_PTE_UFFD_WP:
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
2025-08-19 13:41 ` [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2025-08-19 13:41 ` [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-08-19 13:41 ` Nico Pache
2025-08-20 13:23 ` Lorenzo Stoakes
2025-08-24 1:37 ` Wei Yang
2025-08-19 13:41 ` [PATCH v10 04/13] khugepaged: generalize alloc_charge_folio() Nico Pache
` (12 subsequent siblings)
15 siblings, 2 replies; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
For khugepaged to support different mTHP orders, we must generalize this
to check if the PMD is not shared by another VMA and the order is enabled.
To ensure madvise_collapse can support working on mTHP orders without the
PMD order enabled, we need to convert hugepage_vma_revalidate to take a
bitmap of orders.
No functional change in this patch.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7b98aebb670..2d192ec961d2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -917,7 +917,7 @@ static int collapse_find_target_node(struct collapse_control *cc)
static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
bool expect_anon,
struct vm_area_struct **vmap,
- struct collapse_control *cc)
+ struct collapse_control *cc, unsigned long orders)
{
struct vm_area_struct *vma;
enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
@@ -930,9 +930,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
if (!vma)
return SCAN_VMA_NULL;
+ /* Always check the PMD order to insure its not shared by another VMA */
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
return SCAN_ADDRESS_RANGE;
- if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
return SCAN_VMA_CHECK;
/*
* Anon VMA expected, the address may be unmapped then
@@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
goto out_nolock;
mmap_read_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+ BIT(HPAGE_PMD_ORDER));
if (result != SCAN_SUCCEED) {
mmap_read_unlock(mm);
goto out_nolock;
@@ -1168,7 +1170,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
* mmap_lock.
*/
mmap_write_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+ BIT(HPAGE_PMD_ORDER));
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
@@ -2807,7 +2810,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
mmap_read_lock(mm);
mmap_locked = true;
result = hugepage_vma_revalidate(mm, addr, false, &vma,
- cc);
+ cc, BIT(HPAGE_PMD_ORDER));
if (result != SCAN_SUCCEED) {
last_fail = result;
goto out_nolock;
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 04/13] khugepaged: generalize alloc_charge_folio()
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (2 preceding siblings ...)
2025-08-19 13:41 ` [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-08-19 13:41 ` Nico Pache
2025-08-20 13:28 ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
` (11 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
From: Dev Jain <dev.jain@arm.com>
Pass order to alloc_charge_folio() and update mTHP statistics.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Documentation/admin-guide/mm/transhuge.rst | 8 ++++++++
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 4 ++++
mm/khugepaged.c | 17 +++++++++++------
4 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index a16a04841b96..7ccb93e22852 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -630,6 +630,14 @@ anon_fault_fallback_charge
instead falls back to using huge pages with lower orders or
small pages even though the allocation was successful.
+collapse_alloc
+ is incremented every time a huge page is successfully allocated for a
+ khugepaged collapse.
+
+collapse_alloc_failed
+ is incremented every time a huge page allocation fails during a
+ khugepaged collapse.
+
zswpout
is incremented every time a huge page is swapped out to zswap in one
piece without splitting.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1ac0d06fb3c1..4ada5d1f7297 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,6 +128,8 @@ enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_ALLOC,
MTHP_STAT_ANON_FAULT_FALLBACK,
MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+ MTHP_STAT_COLLAPSE_ALLOC,
+ MTHP_STAT_COLLAPSE_ALLOC_FAILED,
MTHP_STAT_ZSWPOUT,
MTHP_STAT_SWPIN,
MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aac5f0a2cb54..20d005c2c61f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -621,6 +621,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -686,6 +688,8 @@ static struct attribute *any_stats_attrs[] = {
#endif
&split_attr.attr,
&split_failed_attr.attr,
+ &collapse_alloc_attr.attr,
+ &collapse_alloc_failed_attr.attr,
NULL,
};
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2d192ec961d2..77e0d8ee59a0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1079,21 +1079,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
}
static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
- struct collapse_control *cc)
+ struct collapse_control *cc, unsigned int order)
{
gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
GFP_TRANSHUGE);
int node = collapse_find_target_node(cc);
struct folio *folio;
- folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+ folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
if (!folio) {
*foliop = NULL;
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ if (order == HPAGE_PMD_ORDER)
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
return SCAN_ALLOC_HUGE_PAGE_FAIL;
}
- count_vm_event(THP_COLLAPSE_ALLOC);
+ if (order == HPAGE_PMD_ORDER)
+ count_vm_event(THP_COLLAPSE_ALLOC);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
folio_put(folio);
*foliop = NULL;
@@ -1130,7 +1135,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
*/
mmap_read_unlock(mm);
- result = alloc_charge_folio(&folio, mm, cc);
+ result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out_nolock;
@@ -1863,7 +1868,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
- result = alloc_charge_folio(&new_folio, mm, cc);
+ result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out;
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (3 preceding siblings ...)
2025-08-19 13:41 ` [PATCH v10 04/13] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2025-08-19 13:41 ` Nico Pache
2025-08-20 14:22 ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 06/13] khugepaged: add " Nico Pache
` (10 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.
mTHP collapse can suffer from incosistant behavior, and memory waste
"creep". disable swapin and shared support for mTHP collapse.
No functional changes in this patch.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 62 ++++++++++++++++++++++++++++++++++---------------
1 file changed, 43 insertions(+), 19 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 77e0d8ee59a0..074101d03c9d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -551,15 +551,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
unsigned long address,
pte_t *pte,
struct collapse_control *cc,
- struct list_head *compound_pagelist)
+ struct list_head *compound_pagelist,
+ unsigned int order)
{
struct page *page = NULL;
struct folio *folio = NULL;
pte_t *_pte;
int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
bool writable = false;
+ int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
- for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+ for (_pte = pte; _pte < pte + (1 << order);
_pte++, address += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
if (pte_none(pteval) || (pte_present(pteval) &&
@@ -567,7 +569,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
++none_or_zero;
if (!userfaultfd_armed(vma) &&
(!cc->is_khugepaged ||
- none_or_zero <= khugepaged_max_ptes_none)) {
+ none_or_zero <= scaled_max_ptes_none)) {
continue;
} else {
result = SCAN_EXCEED_NONE_PTE;
@@ -595,8 +597,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
/* See collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
- if (cc->is_khugepaged &&
- shared > khugepaged_max_ptes_shared) {
+ /*
+ * TODO: Support shared pages without leading to further
+ * mTHP collapses. Currently bringing in new pages via
+ * shared may cause a future higher order collapse on a
+ * rescan of the same range.
+ */
+ if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
+ shared > khugepaged_max_ptes_shared)) {
result = SCAN_EXCEED_SHARED_PTE;
count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
goto out;
@@ -697,15 +705,16 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
struct vm_area_struct *vma,
unsigned long address,
spinlock_t *ptl,
- struct list_head *compound_pagelist)
+ struct list_head *compound_pagelist,
+ unsigned int order)
{
- unsigned long end = address + HPAGE_PMD_SIZE;
+ unsigned long end = address + (PAGE_SIZE << order);
struct folio *src, *tmp;
pte_t pteval;
pte_t *_pte;
unsigned int nr_ptes;
- for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+ for (_pte = pte; _pte < pte + (1 << order); _pte += nr_ptes,
address += nr_ptes * PAGE_SIZE) {
nr_ptes = 1;
pteval = ptep_get(_pte);
@@ -761,7 +770,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
pmd_t *pmd,
pmd_t orig_pmd,
struct vm_area_struct *vma,
- struct list_head *compound_pagelist)
+ struct list_head *compound_pagelist,
+ unsigned int order)
{
spinlock_t *pmd_ptl;
@@ -778,7 +788,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
* Release both raw and compound pages isolated
* in __collapse_huge_page_isolate.
*/
- release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+ release_pte_pages(pte, pte + (1 << order), compound_pagelist);
}
/*
@@ -799,7 +809,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
unsigned long address, spinlock_t *ptl,
- struct list_head *compound_pagelist)
+ struct list_head *compound_pagelist, unsigned int order)
{
unsigned int i;
int result = SCAN_SUCCEED;
@@ -807,7 +817,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
/*
* Copying pages' contents is subject to memory poison at any iteration.
*/
- for (i = 0; i < HPAGE_PMD_NR; i++) {
+ for (i = 0; i < (1 << order); i++) {
pte_t pteval = ptep_get(pte + i);
struct page *page = folio_page(folio, i);
unsigned long src_addr = address + i * PAGE_SIZE;
@@ -826,10 +836,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
if (likely(result == SCAN_SUCCEED))
__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
- compound_pagelist);
+ compound_pagelist, order);
else
__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
- compound_pagelist);
+ compound_pagelist, order);
return result;
}
@@ -1005,11 +1015,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
static int __collapse_huge_page_swapin(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
- int referenced)
+ int referenced, unsigned int order)
{
int swapped_in = 0;
vm_fault_t ret = 0;
- unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+ unsigned long address, end = haddr + (PAGE_SIZE << order);
int result;
pte_t *pte = NULL;
spinlock_t *ptl;
@@ -1040,6 +1050,19 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
if (!is_swap_pte(vmf.orig_pte))
continue;
+ /*
+ * TODO: Support swapin without leading to further mTHP
+ * collapses. Currently bringing in new pages via swapin may
+ * cause a future higher order collapse on a rescan of the same
+ * range.
+ */
+ if (order != HPAGE_PMD_ORDER) {
+ pte_unmap(pte);
+ mmap_read_unlock(mm);
+ result = SCAN_EXCEED_SWAP_PTE;
+ goto out;
+ }
+
vmf.pte = pte;
vmf.ptl = ptl;
ret = do_swap_page(&vmf);
@@ -1160,7 +1183,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
* that case. Continuing to collapse causes inconsistency.
*/
result = __collapse_huge_page_swapin(mm, vma, address, pmd,
- referenced);
+ referenced, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out_nolock;
}
@@ -1208,7 +1231,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
result = __collapse_huge_page_isolate(vma, address, pte, cc,
- &compound_pagelist);
+ &compound_pagelist,
+ HPAGE_PMD_ORDER);
spin_unlock(pte_ptl);
} else {
result = SCAN_PMD_NULL;
@@ -1238,7 +1262,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
vma, address, pte_ptl,
- &compound_pagelist);
+ &compound_pagelist, HPAGE_PMD_ORDER);
pte_unmap(pte);
if (unlikely(result != SCAN_SUCCEED))
goto out_up_write;
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 06/13] khugepaged: add mTHP support
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (4 preceding siblings ...)
2025-08-19 13:41 ` [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2025-08-19 13:41 ` Nico Pache
2025-08-20 18:29 ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
` (9 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning PMD ranges for potential collapse candidates, keep track
of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
mTHPs are enabled we remove the restriction of max_ptes_none during the
scan phase so we don't bailout early and miss potential mTHP candidates.
A new function collapse_scan_bitmap is used to perform binary recursion on
the bitmap and determine the best eligible order for the collapse.
A stack struct is used instead of traditional recursion. max_ptes_none
will be scaled by the attempted collapse order to determine how "full" an
order must be before being considered for collapse.
Once we determine what mTHP sizes fits best in that PMD range a collapse
is attempted. A minimum collapse order of 2 is used as this is the lowest
order supported by anon memory.
For orders configured with "always", we perform greedy collapsing
to that order without considering bit density.
If a mTHP collapse is attempted, but contains swapped out, or shared
pages, we don't perform the collapse. This is because adding new entries
can lead to new none pages, and these may lead to constant promotion into
a higher order (m)THP. A similar issue can occur with "max_ptes_none >
HPAGE_PMD_NR/2" due to the fact that a collapse will introduce at least 2x
the number of pages, and on a future scan will satisfy the promotion
condition once again.
For non-PMD collapse we must leave the anon VMA write locked until after
we collapse the mTHP-- in the PMD case all the pages are isolated, but in
the non-PMD case this is not true, and we must keep the lock to prevent
changes to the VMA from occurring.
Currently madv_collapse is not supported and will only attempt PMD
collapse.
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/linux/khugepaged.h | 4 +
mm/khugepaged.c | 236 +++++++++++++++++++++++++++++--------
2 files changed, 188 insertions(+), 52 deletions(-)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..d12cdb9ef3ba 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -1,6 +1,10 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_KHUGEPAGED_H
#define _LINUX_KHUGEPAGED_H
+#define KHUGEPAGED_MIN_MTHP_ORDER 2
+#define KHUGEPAGED_MIN_MTHP_NR (1 << KHUGEPAGED_MIN_MTHP_ORDER)
+#define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
+#define MTHP_BITMAP_SIZE (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
#include <linux/mm.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 074101d03c9d..1ad7e00d3fd6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
static struct kmem_cache *mm_slot_cache __ro_after_init;
+struct scan_bit_state {
+ u8 order;
+ u16 offset;
+};
+
struct collapse_control {
bool is_khugepaged;
@@ -102,6 +107,18 @@ struct collapse_control {
/* nodemask for allocation fallback */
nodemask_t alloc_nmask;
+
+ /*
+ * bitmap used to collapse mTHP sizes.
+ * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
+ */
+ DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+ DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
+ struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
+};
+
+struct collapse_control khugepaged_collapse_control = {
+ .is_khugepaged = true,
};
/**
@@ -854,10 +871,6 @@ static void khugepaged_alloc_sleep(void)
remove_wait_queue(&khugepaged_wait, &wait);
}
-struct collapse_control khugepaged_collapse_control = {
- .is_khugepaged = true,
-};
-
static bool collapse_scan_abort(int nid, struct collapse_control *cc)
{
int i;
@@ -1136,17 +1149,19 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
int referenced, int unmapped,
- struct collapse_control *cc)
+ struct collapse_control *cc, bool *mmap_locked,
+ unsigned int order, unsigned long offset)
{
LIST_HEAD(compound_pagelist);
pmd_t *pmd, _pmd;
- pte_t *pte;
+ pte_t *pte = NULL, mthp_pte;
pgtable_t pgtable;
struct folio *folio;
spinlock_t *pmd_ptl, *pte_ptl;
int result = SCAN_FAIL;
struct vm_area_struct *vma;
struct mmu_notifier_range range;
+ unsigned long _address = address + offset * PAGE_SIZE;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -1155,16 +1170,20 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
* The allocation can take potentially a long time if it involves
* sync compaction, and we do not need to hold the mmap_lock during
* that. We will recheck the vma after taking it again in write mode.
+ * If collapsing mTHPs we may have already released the read_lock.
*/
- mmap_read_unlock(mm);
+ if (*mmap_locked) {
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+ }
- result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+ result = alloc_charge_folio(&folio, mm, cc, order);
if (result != SCAN_SUCCEED)
goto out_nolock;
mmap_read_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
- BIT(HPAGE_PMD_ORDER));
+ *mmap_locked = true;
+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
if (result != SCAN_SUCCEED) {
mmap_read_unlock(mm);
goto out_nolock;
@@ -1182,13 +1201,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
* released when it fails. So we jump out_nolock directly in
* that case. Continuing to collapse causes inconsistency.
*/
- result = __collapse_huge_page_swapin(mm, vma, address, pmd,
- referenced, HPAGE_PMD_ORDER);
+ result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
+ referenced, order);
if (result != SCAN_SUCCEED)
goto out_nolock;
}
mmap_read_unlock(mm);
+ *mmap_locked = false;
/*
* Prevent all access to pagetables with the exception of
* gup_fast later handled by the ptep_clear_flush and the VM
@@ -1198,8 +1218,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
* mmap_lock.
*/
mmap_write_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
- BIT(HPAGE_PMD_ORDER));
+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
@@ -1210,11 +1229,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
anon_vma_lock_write(vma->anon_vma);
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
- address + HPAGE_PMD_SIZE);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
+ _address + (PAGE_SIZE << order));
mmu_notifier_invalidate_range_start(&range);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+
/*
* This removes any huge TLB entry from the CPU so we won't allow
* huge and small TLB entries for the same virtual address to
@@ -1228,19 +1248,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
mmu_notifier_invalidate_range_end(&range);
tlb_remove_table_sync_one();
- pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+ pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
if (pte) {
- result = __collapse_huge_page_isolate(vma, address, pte, cc,
- &compound_pagelist,
- HPAGE_PMD_ORDER);
+ result = __collapse_huge_page_isolate(vma, _address, pte, cc,
+ &compound_pagelist, order);
spin_unlock(pte_ptl);
} else {
result = SCAN_PMD_NULL;
}
if (unlikely(result != SCAN_SUCCEED)) {
- if (pte)
- pte_unmap(pte);
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
/*
@@ -1255,17 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
}
/*
- * All pages are isolated and locked so anon_vma rmap
- * can't run anymore.
+ * For PMD collapse all pages are isolated and locked so anon_vma
+ * rmap can't run anymore
*/
- anon_vma_unlock_write(vma->anon_vma);
+ if (order == HPAGE_PMD_ORDER)
+ anon_vma_unlock_write(vma->anon_vma);
result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
- vma, address, pte_ptl,
- &compound_pagelist, HPAGE_PMD_ORDER);
- pte_unmap(pte);
+ vma, _address, pte_ptl,
+ &compound_pagelist, order);
if (unlikely(result != SCAN_SUCCEED))
- goto out_up_write;
+ goto out_unlock_anon_vma;
/*
* The smp_wmb() inside __folio_mark_uptodate() ensures the
@@ -1273,33 +1290,115 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
* write.
*/
__folio_mark_uptodate(folio);
- pgtable = pmd_pgtable(_pmd);
-
- _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
- _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
- spin_lock(pmd_ptl);
- BUG_ON(!pmd_none(*pmd));
- folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
- folio_add_lru_vma(folio, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, address, pmd, _pmd);
- update_mmu_cache_pmd(vma, address, pmd);
- deferred_split_folio(folio, false);
- spin_unlock(pmd_ptl);
+ if (order == HPAGE_PMD_ORDER) {
+ pgtable = pmd_pgtable(_pmd);
+ _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+ spin_lock(pmd_ptl);
+ BUG_ON(!pmd_none(*pmd));
+ folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+ folio_add_lru_vma(folio, vma);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ set_pmd_at(mm, address, pmd, _pmd);
+ update_mmu_cache_pmd(vma, address, pmd);
+ deferred_split_folio(folio, false);
+ spin_unlock(pmd_ptl);
+ } else { /* mTHP collapse */
+ mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+ mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+ spin_lock(pmd_ptl);
+ BUG_ON(!pmd_none(*pmd));
+ folio_ref_add(folio, (1 << order) - 1);
+ folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+ folio_add_lru_vma(folio, vma);
+ set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
+ update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
+
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+ spin_unlock(pmd_ptl);
+ }
folio = NULL;
result = SCAN_SUCCEED;
+out_unlock_anon_vma:
+ if (order != HPAGE_PMD_ORDER)
+ anon_vma_unlock_write(vma->anon_vma);
out_up_write:
+ if (pte)
+ pte_unmap(pte);
mmap_write_unlock(mm);
out_nolock:
+ *mmap_locked = false;
if (folio)
folio_put(folio);
trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
return result;
}
+/* Recursive function to consume the bitmap */
+static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
+ int referenced, int unmapped, struct collapse_control *cc,
+ bool *mmap_locked, unsigned long enabled_orders)
+{
+ u8 order, next_order;
+ u16 offset, mid_offset;
+ int num_chunks;
+ int bits_set, threshold_bits;
+ int top = -1;
+ int collapsed = 0;
+ int ret;
+ struct scan_bit_state state;
+ bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
+ cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+ { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
+
+ while (top >= 0) {
+ state = cc->mthp_bitmap_stack[top--];
+ order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
+ offset = state.offset;
+ num_chunks = 1 << (state.order);
+ /* Skip mTHP orders that are not enabled */
+ if (!test_bit(order, &enabled_orders))
+ goto next_order;
+
+ /* copy the relavant section to a new bitmap */
+ bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
+ MTHP_BITMAP_SIZE);
+
+ bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
+ threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
+ >> (HPAGE_PMD_ORDER - state.order);
+
+ /* Check if the region is "almost full" based on the threshold */
+ if (bits_set > threshold_bits || is_pmd_only
+ || test_bit(order, &huge_anon_orders_always)) {
+ ret = collapse_huge_page(mm, address, referenced, unmapped,
+ cc, mmap_locked, order,
+ offset * KHUGEPAGED_MIN_MTHP_NR);
+ if (ret == SCAN_SUCCEED) {
+ collapsed += (1 << order);
+ continue;
+ }
+ }
+
+next_order:
+ if (state.order > 0) {
+ next_order = state.order - 1;
+ mid_offset = offset + (num_chunks / 2);
+ cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+ { next_order, mid_offset };
+ cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+ { next_order, offset };
+ }
+ }
+ return collapsed;
+}
+
static int collapse_scan_pmd(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, bool *mmap_locked,
@@ -1307,31 +1406,60 @@ static int collapse_scan_pmd(struct mm_struct *mm,
{
pmd_t *pmd;
pte_t *pte, *_pte;
+ int i;
int result = SCAN_FAIL, referenced = 0;
int none_or_zero = 0, shared = 0;
struct page *page = NULL;
struct folio *folio = NULL;
unsigned long _address;
+ unsigned long enabled_orders;
spinlock_t *ptl;
int node = NUMA_NO_NODE, unmapped = 0;
+ bool is_pmd_only;
bool writable = false;
-
+ int chunk_none_count = 0;
+ int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
+ unsigned long tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
result = find_pmd_or_thp_or_none(mm, address, &pmd);
if (result != SCAN_SUCCEED)
goto out;
+ bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+ bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
+
+ if (cc->is_khugepaged)
+ enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+ tva_flags, THP_ORDERS_ALL_ANON);
+ else
+ enabled_orders = BIT(HPAGE_PMD_ORDER);
+
+ is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!pte) {
result = SCAN_PMD_NULL;
goto out;
}
- for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
- _pte++, _address += PAGE_SIZE) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ /*
+ * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
+ * there are pages in this chunk keep track of it in the bitmap
+ * for mTHP collapsing.
+ */
+ if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
+ if (i > 0 && chunk_none_count <= scaled_none)
+ bitmap_set(cc->mthp_bitmap,
+ (i - 1) / KHUGEPAGED_MIN_MTHP_NR, 1);
+ chunk_none_count = 0;
+ }
+
+ _pte = pte + i;
+ _address = address + i * PAGE_SIZE;
pte_t pteval = ptep_get(_pte);
if (is_swap_pte(pteval)) {
++unmapped;
@@ -1354,10 +1482,11 @@ static int collapse_scan_pmd(struct mm_struct *mm,
}
}
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+ ++chunk_none_count;
++none_or_zero;
if (!userfaultfd_armed(vma) &&
- (!cc->is_khugepaged ||
- none_or_zero <= khugepaged_max_ptes_none)) {
+ (!cc->is_khugepaged || !is_pmd_only ||
+ none_or_zero <= khugepaged_max_ptes_none)) {
continue;
} else {
result = SCAN_EXCEED_NONE_PTE;
@@ -1453,6 +1582,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
address)))
referenced++;
}
+
if (!writable) {
result = SCAN_PAGE_RO;
} else if (cc->is_khugepaged &&
@@ -1465,10 +1595,12 @@ static int collapse_scan_pmd(struct mm_struct *mm,
out_unmap:
pte_unmap_unlock(pte, ptl);
if (result == SCAN_SUCCEED) {
- result = collapse_huge_page(mm, address, referenced,
- unmapped, cc);
- /* collapse_huge_page will return with the mmap_lock released */
- *mmap_locked = false;
+ result = collapse_scan_bitmap(mm, address, referenced, unmapped, cc,
+ mmap_locked, enabled_orders);
+ if (result > 0)
+ result = SCAN_SUCCEED;
+ else
+ result = SCAN_FAIL;
}
out:
trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (5 preceding siblings ...)
2025-08-19 13:41 ` [PATCH v10 06/13] khugepaged: add " Nico Pache
@ 2025-08-19 13:41 ` Nico Pache
2025-08-21 12:05 ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 08/13] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
` (8 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:41 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).
This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ad7e00d3fd6..6a4cf7e4a7cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -611,6 +611,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
folio = page_folio(page);
VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+ /*
+ * TODO: In some cases of partially-mapped folios, we'd actually
+ * want to collapse.
+ */
+ if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+ result = SCAN_PTE_MAPPED_HUGEPAGE;
+ goto out;
+ }
+
/* See collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 08/13] khugepaged: avoid unnecessary mTHP collapse attempts
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (6 preceding siblings ...)
2025-08-19 13:41 ` [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-08-19 13:42 ` Nico Pache
2025-08-20 10:38 ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 09/13] khugepaged: enable collapsing mTHPs even when PMD THPs are disabled Nico Pache
` (7 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:42 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6a4cf7e4a7cc..7d9b5100bea1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1389,10 +1389,39 @@ static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
ret = collapse_huge_page(mm, address, referenced, unmapped,
cc, mmap_locked, order,
offset * KHUGEPAGED_MIN_MTHP_NR);
- if (ret == SCAN_SUCCEED) {
+
+ /*
+ * Analyze failure reason to determine next action:
+ * - goto next_order: try smaller orders in same region
+ * - continue: try other regions at same order
+ * - break: stop all attempts (system-wide failure)
+ */
+ switch (ret) {
+ /* Cases were we should continue to the next region */
+ case SCAN_SUCCEED:
collapsed += (1 << order);
+ case SCAN_PAGE_RO:
+ case SCAN_PTE_MAPPED_HUGEPAGE:
continue;
+ /* Cases were lower orders might still succeed */
+ case SCAN_LACK_REFERENCED_PAGE:
+ case SCAN_EXCEED_NONE_PTE:
+ case SCAN_EXCEED_SWAP_PTE:
+ case SCAN_EXCEED_SHARED_PTE:
+ case SCAN_PAGE_LOCK:
+ case SCAN_PAGE_COUNT:
+ case SCAN_PAGE_LRU:
+ case SCAN_PAGE_NULL:
+ case SCAN_DEL_PAGE_LRU:
+ case SCAN_PTE_NON_PRESENT:
+ case SCAN_PTE_UFFD_WP:
+ case SCAN_ALLOC_HUGE_PAGE_FAIL:
+ goto next_order;
+ /* All other cases should stop collapse attempts */
+ default:
+ break;
}
+ break;
}
next_order:
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 09/13] khugepaged: enable collapsing mTHPs even when PMD THPs are disabled
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (7 preceding siblings ...)
2025-08-19 13:42 ` [PATCH v10 08/13] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-08-19 13:42 ` Nico Pache
2025-08-21 13:35 ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
` (6 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:42 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
From: Baolin Wang <baolin.wang@linux.alibaba.com>
We have now allowed mTHP collapse, but thp_vma_allowable_order() still only
checks if the PMD-sized mTHP is allowed to collapse. This prevents scanning
and collapsing of 64K mTHP when only 64K mTHP is enabled. Thus, we should
modify the checks to allow all large orders of anonymous mTHP.
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7d9b5100bea1..2cadd07341de 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -491,7 +491,11 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
{
if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
hugepage_pmd_enabled()) {
- if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+ unsigned long orders = vma_is_anonymous(vma) ?
+ THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
+
+ if (thp_vma_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED,
+ orders))
__khugepaged_enter(vma->vm_mm);
}
}
@@ -2671,6 +2675,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
vma_iter_init(&vmi, mm, khugepaged_scan.address);
for_each_vma(vmi, vma) {
+ unsigned long orders = vma_is_anonymous(vma) ?
+ THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
unsigned long hstart, hend;
cond_resched();
@@ -2678,7 +2684,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
progress++;
break;
}
- if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+ if (!thp_vma_allowable_orders(vma, vma->vm_flags,
+ TVA_KHUGEPAGED, orders)) {
skip:
progress++;
continue;
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (8 preceding siblings ...)
2025-08-19 13:42 ` [PATCH v10 09/13] khugepaged: enable collapsing mTHPs even when PMD THPs are disabled Nico Pache
@ 2025-08-19 13:42 ` Nico Pache
2025-08-21 14:18 ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 11/13] khugepaged: improve tracepoints for mTHP orders Nico Pache
` (5 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:42 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
From: Baolin Wang <baolin.wang@linux.alibaba.com>
When only non-PMD-sized mTHP is enabled (such as only 64K mTHP enabled),
we should also allow kicking khugepaged to attempt scanning and collapsing
64K mTHP. Modify hugepage_pmd_enabled() to support mTHP collapse, and
while we are at it, rename it to make the function name more clear.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2cadd07341de..81d2ffd56ab9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -430,7 +430,7 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
}
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
{
/*
* We cover the anon, shmem and the file-backed case here; file-backed
@@ -442,11 +442,11 @@ static bool hugepage_pmd_enabled(void)
if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
hugepage_global_enabled())
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+ if (READ_ONCE(huge_anon_orders_always))
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+ if (READ_ONCE(huge_anon_orders_madvise))
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+ if (READ_ONCE(huge_anon_orders_inherit) &&
hugepage_global_enabled())
return true;
if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -490,7 +490,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
- hugepage_pmd_enabled()) {
+ hugepage_enabled()) {
unsigned long orders = vma_is_anonymous(vma) ?
THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
@@ -2762,7 +2762,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
static int khugepaged_has_work(void)
{
- return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+ return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
}
static int khugepaged_wait_event(void)
@@ -2835,7 +2835,7 @@ static void khugepaged_wait_work(void)
return;
}
- if (hugepage_pmd_enabled())
+ if (hugepage_enabled())
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
}
@@ -2866,7 +2866,7 @@ static void set_recommended_min_free_kbytes(void)
int nr_zones = 0;
unsigned long recommended_min;
- if (!hugepage_pmd_enabled()) {
+ if (!hugepage_enabled()) {
calculate_min_free_kbytes();
goto update_wmarks;
}
@@ -2916,7 +2916,7 @@ int start_stop_khugepaged(void)
int err = 0;
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled()) {
+ if (hugepage_enabled()) {
if (!khugepaged_thread)
khugepaged_thread = kthread_run(khugepaged, NULL,
"khugepaged");
@@ -2942,7 +2942,7 @@ int start_stop_khugepaged(void)
void khugepaged_min_free_kbytes_update(void)
{
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled() && khugepaged_thread)
+ if (hugepage_enabled() && khugepaged_thread)
set_recommended_min_free_kbytes();
mutex_unlock(&khugepaged_mutex);
}
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 11/13] khugepaged: improve tracepoints for mTHP orders
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (9 preceding siblings ...)
2025-08-19 13:42 ` [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
@ 2025-08-19 13:42 ` Nico Pache
2025-08-21 14:24 ` Lorenzo Stoakes
2025-08-19 14:16 ` [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats Nico Pache
` (4 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 13:42 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
Add the order to the tracepoints to give better insight into what order
is being operated at for khugepaged.
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
mm/khugepaged.c | 10 +++++----
2 files changed, 28 insertions(+), 16 deletions(-)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 2305df6cb485..56aa8c3b011b 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -92,34 +92,37 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
TRACE_EVENT(mm_collapse_huge_page,
- TP_PROTO(struct mm_struct *mm, int isolated, int status),
+ TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order),
- TP_ARGS(mm, isolated, status),
+ TP_ARGS(mm, isolated, status, order),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
__field(int, isolated)
__field(int, status)
+ __field(unsigned int, order)
),
TP_fast_assign(
__entry->mm = mm;
__entry->isolated = isolated;
__entry->status = status;
+ __entry->order = order;
),
- TP_printk("mm=%p, isolated=%d, status=%s",
+ TP_printk("mm=%p, isolated=%d, status=%s order=%u",
__entry->mm,
__entry->isolated,
- __print_symbolic(__entry->status, SCAN_STATUS))
+ __print_symbolic(__entry->status, SCAN_STATUS),
+ __entry->order)
);
TRACE_EVENT(mm_collapse_huge_page_isolate,
TP_PROTO(struct folio *folio, int none_or_zero,
- int referenced, bool writable, int status),
+ int referenced, bool writable, int status, unsigned int order),
- TP_ARGS(folio, none_or_zero, referenced, writable, status),
+ TP_ARGS(folio, none_or_zero, referenced, writable, status, order),
TP_STRUCT__entry(
__field(unsigned long, pfn)
@@ -127,6 +130,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
__field(int, referenced)
__field(bool, writable)
__field(int, status)
+ __field(unsigned int, order)
),
TP_fast_assign(
@@ -135,27 +139,31 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
__entry->referenced = referenced;
__entry->writable = writable;
__entry->status = status;
+ __entry->order = order;
),
- TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
+ TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s order=%u",
__entry->pfn,
__entry->none_or_zero,
__entry->referenced,
__entry->writable,
- __print_symbolic(__entry->status, SCAN_STATUS))
+ __print_symbolic(__entry->status, SCAN_STATUS),
+ __entry->order)
);
TRACE_EVENT(mm_collapse_huge_page_swapin,
- TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+ TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+ unsigned int order),
- TP_ARGS(mm, swapped_in, referenced, ret),
+ TP_ARGS(mm, swapped_in, referenced, ret, order),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
__field(int, swapped_in)
__field(int, referenced)
__field(int, ret)
+ __field(unsigned int, order)
),
TP_fast_assign(
@@ -163,13 +171,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
__entry->swapped_in = swapped_in;
__entry->referenced = referenced;
__entry->ret = ret;
+ __entry->order = order;
),
- TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+ TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u",
__entry->mm,
__entry->swapped_in,
__entry->referenced,
- __entry->ret)
+ __entry->ret,
+ __entry->order)
);
TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 81d2ffd56ab9..c13bc583a368 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -721,13 +721,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
} else {
result = SCAN_SUCCEED;
trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
- referenced, writable, result);
+ referenced, writable, result,
+ order);
return result;
}
out:
release_pte_pages(pte, _pte, compound_pagelist);
trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
- referenced, writable, result);
+ referenced, writable, result, order);
return result;
}
@@ -1123,7 +1124,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
result = SCAN_SUCCEED;
out:
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+ order);
return result;
}
@@ -1348,7 +1350,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
*mmap_locked = false;
if (folio)
folio_put(folio);
- trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+ trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
return result;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (10 preceding siblings ...)
2025-08-19 13:42 ` [PATCH v10 11/13] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-08-19 14:16 ` Nico Pache
2025-08-21 14:47 ` Lorenzo Stoakes
2025-08-19 14:17 ` [PATCH v10 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
` (3 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 14:16 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
With mTHP support inplace, let add the per-order mTHP stats for
exceeding NONE, SWAP, and SHARED.
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 7 +++++++
mm/khugepaged.c | 16 +++++++++++++---
4 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 7ccb93e22852..b85547ac4fe9 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -705,6 +705,23 @@ nr_anon_partially_mapped
an anonymous THP as "partially mapped" and count it here, even though it
is not actually partially mapped anymore.
+collapse_exceed_swap_pte
+ The number of anonymous THP which contain at least one swap PTE.
+ Currently khugepaged does not support collapsing mTHP regions that
+ contain a swap PTE.
+
+collapse_exceed_none_pte
+ The number of anonymous THP which have exceeded the none PTE threshold.
+ With mTHP collapse, a bitmap is used to gather the state of a PMD region
+ and is then recursively checked from largest to smallest order against
+ the scaled max_ptes_none count. This counter indicates that the next
+ enabled order will be checked.
+
+collapse_exceed_shared_pte
+ The number of anonymous THP which contain at least one shared PTE.
+ Currently khugepaged does not support collapsing mTHP regions that
+ contain a shared PTE.
+
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4ada5d1f7297..6f1593d0b4b5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -144,6 +144,9 @@ enum mthp_stat_item {
MTHP_STAT_SPLIT_DEFERRED,
MTHP_STAT_NR_ANON,
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+ MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+ MTHP_STAT_COLLAPSE_EXCEED_NONE,
+ MTHP_STAT_COLLAPSE_EXCEED_SHARED,
__MTHP_STAT_COUNT
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 20d005c2c61f..9f0470c3e983 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -639,6 +639,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
static struct attribute *anon_stats_attrs[] = {
&anon_fault_alloc_attr.attr,
@@ -655,6 +659,9 @@ static struct attribute *anon_stats_attrs[] = {
&split_deferred_attr.attr,
&nr_anon_attr.attr,
&nr_anon_partially_mapped_attr.attr,
+ &collapse_exceed_swap_pte_attr.attr,
+ &collapse_exceed_none_pte_attr.attr,
+ &collapse_exceed_shared_pte_attr.attr,
NULL,
};
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c13bc583a368..5a3386043f39 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -594,7 +594,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
continue;
} else {
result = SCAN_EXCEED_NONE_PTE;
- count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ if (order == HPAGE_PMD_ORDER)
+ count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
goto out;
}
}
@@ -633,10 +635,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
* shared may cause a future higher order collapse on a
* rescan of the same range.
*/
- if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
- shared > khugepaged_max_ptes_shared)) {
+ if (order != HPAGE_PMD_ORDER) {
+ result = SCAN_EXCEED_SHARED_PTE;
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+ goto out;
+ }
+
+ if (cc->is_khugepaged &&
+ shared > khugepaged_max_ptes_shared) {
result = SCAN_EXCEED_SHARED_PTE;
count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
goto out;
}
}
@@ -1084,6 +1093,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
* range.
*/
if (order != HPAGE_PMD_ORDER) {
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
pte_unmap(pte);
mmap_read_unlock(mm);
result = SCAN_EXCEED_SWAP_PTE;
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v10 13/13] Documentation: mm: update the admin guide for mTHP collapse
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (11 preceding siblings ...)
2025-08-19 14:16 ` [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats Nico Pache
@ 2025-08-19 14:17 ` Nico Pache
2025-08-21 15:03 ` Lorenzo Stoakes
2025-08-19 21:55 ` [PATCH v10 00/13] khugepaged: mTHP support Andrew Morton
` (2 subsequent siblings)
15 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-19 14:17 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd,
Bagas Sanjaya
Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidence on how to utilize it.
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index b85547ac4fe9..1f9e6a32052c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,7 @@ often.
THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages.
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls.
@@ -149,6 +149,18 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
sizes, the kernel will select the most appropriate enabled size for a
given allocation.
+khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size
+to determine collapses. When using mTHPs it's recommended to set
+max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page
+size). This will prevent undesired "creep" behavior that leads to
+continuously collapsing to the largest mTHP size; when we collapse, we are
+bringing in new non-zero pages that will, on a subsequent scan, cause the
+max_ptes_none check of the +1 order to always be satisfied. By limiting
+this to less than half the current order, we make sure we don't cause this
+feedback loop. max_ptes_shared and max_ptes_swap have no effect when
+collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out
+pages.
+
It's also possible to limit defrag efforts in the VM to generate
anonymous hugepages in case they're not immediately free to madvise
regions or to never try to defrag memory and simply fallback to regular
@@ -264,11 +276,6 @@ support the following arguments::
Khugepaged controls
-------------------
-.. note::
- khugepaged currently only searches for opportunities to collapse to
- PMD-sized THP and no attempt is made to collapse to other THP
- sizes.
-
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
--
2.50.1
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (12 preceding siblings ...)
2025-08-19 14:17 ` [PATCH v10 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2025-08-19 21:55 ` Andrew Morton
2025-08-20 15:55 ` Nico Pache
2025-08-21 15:01 ` Lorenzo Stoakes
2025-09-01 16:21 ` David Hildenbrand
15 siblings, 1 reply; 75+ messages in thread
From: Andrew Morton @ 2025-08-19 21:55 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, baohua,
willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
mhocko, rdunlap, hughd
On Tue, 19 Aug 2025 07:41:52 -0600 Nico Pache <npache@redhat.com> wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
>
> ...
>
> - I created a test script that I used to push khugepaged to its limits
> while monitoring a number of stats and tracepoints. The code is
> available here[1] (Run in legacy mode for these changes and set mthp
> sizes to inherit)
Could this be turned into something in tools/testing/selftests/mm/?
> V10 Changes:
I'll add this to mm-new, thanks. I'll suppress the usual emails.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 08/13] khugepaged: avoid unnecessary mTHP collapse attempts
2025-08-19 13:42 ` [PATCH v10 08/13] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2025-08-20 10:38 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-20 10:38 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:42:00AM -0600, Nico Pache wrote:
> There are cases where, if an attempted collapse fails, all subsequent
> orders are guaranteed to also fail. Avoid these collapse attempts by
> bailing out early.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 31 ++++++++++++++++++++++++++++++-
> 1 file changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6a4cf7e4a7cc..7d9b5100bea1 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1389,10 +1389,39 @@ static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> ret = collapse_huge_page(mm, address, referenced, unmapped,
> cc, mmap_locked, order,
> offset * KHUGEPAGED_MIN_MTHP_NR);
> - if (ret == SCAN_SUCCEED) {
> +
> + /*
> + * Analyze failure reason to determine next action:
> + * - goto next_order: try smaller orders in same region
> + * - continue: try other regions at same order
> + * - break: stop all attempts (system-wide failure)
> + */
> + switch (ret) {
> + /* Cases were we should continue to the next region */
> + case SCAN_SUCCEED:
> collapsed += (1 << order);
Yeah as bot noticed (and clang locally)
This needs a break or fallthrough.
> + case SCAN_PAGE_RO:
> + case SCAN_PTE_MAPPED_HUGEPAGE:
> continue;
> + /* Cases were lower orders might still succeed */
> + case SCAN_LACK_REFERENCED_PAGE:
> + case SCAN_EXCEED_NONE_PTE:
> + case SCAN_EXCEED_SWAP_PTE:
> + case SCAN_EXCEED_SHARED_PTE:
> + case SCAN_PAGE_LOCK:
> + case SCAN_PAGE_COUNT:
> + case SCAN_PAGE_LRU:
> + case SCAN_PAGE_NULL:
> + case SCAN_DEL_PAGE_LRU:
> + case SCAN_PTE_NON_PRESENT:
> + case SCAN_PTE_UFFD_WP:
> + case SCAN_ALLOC_HUGE_PAGE_FAIL:
> + goto next_order;
> + /* All other cases should stop collapse attempts */
> + default:
> + break;
Wouldn't it be better to not have this so the compiler asserts that you
have all cases listed here?
> }
> + break;
> }
>
> next_order:
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_*
2025-08-19 13:41 ` [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
@ 2025-08-20 10:42 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-20 10:42 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:41:53AM -0600, Nico Pache wrote:
> The hpage_collapse functions describe functions used by madvise_collapse
> and khugepaged. remove the unnecessary hpage prefix to shorten the
> function name.
>
Not a big deal, but You missed a comment in mm/mremap.c:
* Now new_pte is none, so hpage_collapse_scan_file() path can not find
In move_ptes().
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
Apart from nit above LGTM so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/khugepaged.c | 73 ++++++++++++++++++++++++-------------------------
> 1 file changed, 36 insertions(+), 37 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d3d4f116e14b..0e7bbadf03ee 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
> kmem_cache_destroy(mm_slot_cache);
> }
>
> -static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> +static inline int collapse_test_exit(struct mm_struct *mm)
> {
> return atomic_read(&mm->mm_users) == 0;
> }
>
> -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> +static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
> {
> - return hpage_collapse_test_exit(mm) ||
> + return collapse_test_exit(mm) ||
> mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
> }
>
> @@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
> int wakeup;
>
> /* __khugepaged_exit() must not run from under us */
> - VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> + VM_BUG_ON_MM(collapse_test_exit(mm), mm);
> if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
> return;
>
> @@ -502,7 +502,7 @@ void __khugepaged_exit(struct mm_struct *mm)
> } else if (mm_slot) {
> /*
> * This is required to serialize against
> - * hpage_collapse_test_exit() (which is guaranteed to run
> + * collapse_test_exit() (which is guaranteed to run
> * under mmap sem read mode). Stop here (after we return all
> * pagetables will be destroyed) until khugepaged has finished
> * working on the pagetables under the mmap_lock.
> @@ -592,7 +592,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> folio = page_folio(page);
> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>
> - /* See hpage_collapse_scan_pmd(). */
> + /* See collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> ++shared;
> if (cc->is_khugepaged &&
> @@ -848,7 +848,7 @@ struct collapse_control khugepaged_collapse_control = {
> .is_khugepaged = true,
> };
>
> -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> +static bool collapse_scan_abort(int nid, struct collapse_control *cc)
> {
> int i;
>
> @@ -883,7 +883,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
> }
>
> #ifdef CONFIG_NUMA
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int collapse_find_target_node(struct collapse_control *cc)
> {
> int nid, target_node = 0, max_value = 0;
>
> @@ -902,7 +902,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
> return target_node;
> }
> #else
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int collapse_find_target_node(struct collapse_control *cc)
> {
> return 0;
> }
> @@ -923,7 +923,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> TVA_FORCED_COLLAPSE;
>
> - if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> + if (unlikely(collapse_test_exit_or_disable(mm)))
> return SCAN_ANY_PROCESS;
>
> *vmap = vma = find_vma(mm, address);
> @@ -996,7 +996,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>
> /*
> * Bring missing pages in from swap, to complete THP collapse.
> - * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> *
> * Called and returns without pte mapped or spinlocks held.
> * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> @@ -1082,7 +1082,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> {
> gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> GFP_TRANSHUGE);
> - int node = hpage_collapse_find_target_node(cc);
> + int node = collapse_find_target_node(cc);
> struct folio *folio;
>
> folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> @@ -1268,10 +1268,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> return result;
> }
>
> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> - struct vm_area_struct *vma,
> - unsigned long address, bool *mmap_locked,
> - struct collapse_control *cc)
> +static int collapse_scan_pmd(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long address, bool *mmap_locked,
> + struct collapse_control *cc)
> {
> pmd_t *pmd;
> pte_t *pte, *_pte;
> @@ -1382,7 +1382,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> * hit record.
> */
> node = folio_nid(folio);
> - if (hpage_collapse_scan_abort(node, cc)) {
> + if (collapse_scan_abort(node, cc)) {
> result = SCAN_SCAN_ABORT;
> goto out_unmap;
> }
> @@ -1451,7 +1451,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>
> lockdep_assert_held(&khugepaged_mm_lock);
>
> - if (hpage_collapse_test_exit(mm)) {
> + if (collapse_test_exit(mm)) {
> /* free mm_slot */
> hash_del(&slot->hash);
> list_del(&slot->mm_node);
> @@ -1753,7 +1753,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> continue;
>
> - if (hpage_collapse_test_exit(mm))
> + if (collapse_test_exit(mm))
> continue;
> /*
> * When a vma is registered with uffd-wp, we cannot recycle
> @@ -2275,9 +2275,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> return result;
> }
>
> -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> - struct file *file, pgoff_t start,
> - struct collapse_control *cc)
> +static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> + struct file *file, pgoff_t start,
> + struct collapse_control *cc)
> {
> struct folio *folio = NULL;
> struct address_space *mapping = file->f_mapping;
> @@ -2332,7 +2332,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> }
>
> node = folio_nid(folio);
> - if (hpage_collapse_scan_abort(node, cc)) {
> + if (collapse_scan_abort(node, cc)) {
> result = SCAN_SCAN_ABORT;
> folio_put(folio);
> break;
> @@ -2382,7 +2382,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> return result;
> }
>
> -static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> +static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> struct collapse_control *cc)
> __releases(&khugepaged_mm_lock)
> __acquires(&khugepaged_mm_lock)
> @@ -2420,7 +2420,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> goto breakouterloop_mmap_lock;
>
> progress++;
> - if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> + if (unlikely(collapse_test_exit_or_disable(mm)))
> goto breakouterloop;
>
> vma_iter_init(&vmi, mm, khugepaged_scan.address);
> @@ -2428,7 +2428,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> unsigned long hstart, hend;
>
> cond_resched();
> - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> + if (unlikely(collapse_test_exit_or_disable(mm))) {
> progress++;
> break;
> }
> @@ -2449,7 +2449,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> bool mmap_locked = true;
>
> cond_resched();
> - if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> + if (unlikely(collapse_test_exit_or_disable(mm)))
> goto breakouterloop;
>
> VM_BUG_ON(khugepaged_scan.address < hstart ||
> @@ -2462,12 +2462,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>
> mmap_read_unlock(mm);
> mmap_locked = false;
> - *result = hpage_collapse_scan_file(mm,
> + *result = collapse_scan_file(mm,
> khugepaged_scan.address, file, pgoff, cc);
> fput(file);
> if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> mmap_read_lock(mm);
> - if (hpage_collapse_test_exit_or_disable(mm))
> + if (collapse_test_exit_or_disable(mm))
> goto breakouterloop;
> *result = collapse_pte_mapped_thp(mm,
> khugepaged_scan.address, false);
> @@ -2476,7 +2476,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> mmap_read_unlock(mm);
> }
> } else {
> - *result = hpage_collapse_scan_pmd(mm, vma,
> + *result = collapse_scan_pmd(mm, vma,
> khugepaged_scan.address, &mmap_locked, cc);
> }
>
> @@ -2509,7 +2509,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> * Release the current mm_slot if this mm is about to die, or
> * if we scanned all vmas of this mm.
> */
> - if (hpage_collapse_test_exit(mm) || !vma) {
> + if (collapse_test_exit(mm) || !vma) {
> /*
> * Make sure that if mm_users is reaching zero while
> * khugepaged runs here, khugepaged_exit will find
> @@ -2563,8 +2563,8 @@ static void khugepaged_do_scan(struct collapse_control *cc)
> pass_through_head++;
> if (khugepaged_has_work() &&
> pass_through_head < 2)
> - progress += khugepaged_scan_mm_slot(pages - progress,
> - &result, cc);
> + progress += collapse_scan_mm_slot(pages - progress,
> + &result, cc);
> else
> progress = pages;
> spin_unlock(&khugepaged_mm_lock);
> @@ -2805,12 +2805,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
> mmap_read_unlock(mm);
> mmap_locked = false;
> - result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> - cc);
> + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> fput(file);
> } else {
> - result = hpage_collapse_scan_pmd(mm, vma, addr,
> - &mmap_locked, cc);
> + result = collapse_scan_pmd(mm, vma, addr,
> + &mmap_locked, cc);
> }
> if (!mmap_locked)
> *lock_dropped = true;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2025-08-19 13:41 ` [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2025-08-20 11:21 ` Lorenzo Stoakes
2025-08-20 16:35 ` Nico Pache
0 siblings, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-20 11:21 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:41:54AM -0600, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
>
> Create collapse_single_pmd to increase code reuse and create an entry
> point to these two users.
>
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 94 ++++++++++++++++++++++++++-----------------------
> 1 file changed, 49 insertions(+), 45 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 0e7bbadf03ee..b7b98aebb670 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2382,6 +2382,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> return result;
> }
>
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int collapse_single_pmd(unsigned long addr,
> + struct vm_area_struct *vma, bool *mmap_locked,
> + struct collapse_control *cc)
> +{
> + int result = SCAN_FAIL;
You assign result in all branches, so this can be uninitialised.
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if (!vma_is_anonymous(vma)) {
> + struct file *file = get_file(vma->vm_file);
> + pgoff_t pgoff = linear_page_index(vma, addr);
> +
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> + fput(file);
> + if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> + mmap_read_lock(mm);
> + *mmap_locked = true;
> + if (collapse_test_exit_or_disable(mm)) {
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + result = SCAN_ANY_PROCESS;
> + goto end;
Don't love that in e.g. collapse_scan_mm_slot() we are using the mmap lock being
disabled as in effect an error code.
Is SCAN_ANY_PROCESS correct here? Because in collapse_scan_mm_slot() you'd
previously:
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
if (collapse_test_exit_or_disable(mm))
goto breakouterloop;
...
}
But now you're setting result = SCAN_ANY_PROCESS rather than
SCAN_PTE_MAPPED_HUGEPAGE in this instance?
You don't mention that you're changing this, or at least explicitly enough,
the commit message should state that you're changing this and explain why
it's ok.
This whole file is horrid, and it's kinda an aside, but I really wish we
had some comment going through each of the scan_result cases and explaining
what each one meant.
Also I think:
return SCAN_ANY_PROCESS;
Is better than:
result = SCAN_ANY_PROCESS;
goto end;
...
end:
return result;
> + }
> + result = collapse_pte_mapped_thp(mm, addr,
> + !cc->is_khugepaged);
Hm another change here, in the original code in collapse_scan_mm_slot()
this is:
*result = collapse_pte_mapped_thp(mm,
khugepaged_scan.address, false);
Presumably collapse_scan_mm_slot() is only ever invoked with
cc->is_khugepaged?
Maybe worth adding a VM_WARN_ON_ONCE(!cc->is_khugepaged) at the top of
collapse_scan_mm_slot() to assert this (and other places where your change
assumes this to be the case).
> + if (result == SCAN_PMD_MAPPED)
> + result = SCAN_SUCCEED;
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + }
> + } else {
> + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> + }
> + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> + ++khugepaged_pages_collapsed;
Similarly, presumably because collapse_scan_mm_slot() only ever invoked
khugepaged case this didn't have the cc->is_khugepaged check?
> +end:
> + return result;
> +}
There's a LOT of nesting going on here, I think we can simplify this a
lot. If we make the change I noted above re: returning SCAN_ANY_PROCESS< we
can move the end label up a bit and avoid a ton of nesting, e.g.:
static int collapse_single_pmd(unsigned long addr,
struct vm_area_struct *vma, bool *mmap_locked,
struct collapse_control *cc)
{
struct mm_struct *mm = vma->vm_mm;
struct file *file;
pgoff_t pgoff;
int result;
if (vma_is_anonymous(vma)) {
result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
goto end:
}
file = get_file(vma->vm_file);
pgoff = linear_page_index(vma, addr);
mmap_read_unlock(mm);
*mmap_locked = false;
result = collapse_scan_file(mm, addr, file, pgoff, cc);
fput(file);
if (result != SCAN_PTE_MAPPED_HUGEPAGE)
goto end;
mmap_read_lock(mm);
*mmap_locked = true;
if (collapse_test_exit_or_disable(mm)) {
mmap_read_unlock(mm);
*mmap_locked = false;
return SCAN_ANY_PROCESS;
}
result = collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
if (result == SCAN_PMD_MAPPED)
result = SCAN_SUCCEED;
mmap_read_unlock(mm);
*mmap_locked = false;
end:
if (cc->is_khugepaged && result == SCAN_SUCCEED)
++khugepaged_pages_collapsed;
return result;
}
(untested, thrown together so do double check!)
> +
> static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> struct collapse_control *cc)
> __releases(&khugepaged_mm_lock)
> @@ -2455,34 +2499,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> VM_BUG_ON(khugepaged_scan.address < hstart ||
> khugepaged_scan.address + HPAGE_PMD_SIZE >
> hend);
> - if (!vma_is_anonymous(vma)) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma,
> - khugepaged_scan.address);
> -
> - mmap_read_unlock(mm);
> - mmap_locked = false;
> - *result = collapse_scan_file(mm,
> - khugepaged_scan.address, file, pgoff, cc);
> - fput(file);
> - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> - mmap_read_lock(mm);
> - if (collapse_test_exit_or_disable(mm))
> - goto breakouterloop;
> - *result = collapse_pte_mapped_thp(mm,
> - khugepaged_scan.address, false);
> - if (*result == SCAN_PMD_MAPPED)
> - *result = SCAN_SUCCEED;
> - mmap_read_unlock(mm);
> - }
> - } else {
> - *result = collapse_scan_pmd(mm, vma,
> - khugepaged_scan.address, &mmap_locked, cc);
> - }
> -
> - if (*result == SCAN_SUCCEED)
> - ++khugepaged_pages_collapsed;
>
> + *result = collapse_single_pmd(khugepaged_scan.address,
> + vma, &mmap_locked, cc);
> /* move to next address */
> khugepaged_scan.address += HPAGE_PMD_SIZE;
> progress += HPAGE_PMD_NR;
> @@ -2799,34 +2818,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> mmap_assert_locked(mm);
> memset(cc->node_load, 0, sizeof(cc->node_load));
> nodes_clear(cc->alloc_nmask);
> - if (!vma_is_anonymous(vma)) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma, addr);
>
> - mmap_read_unlock(mm);
> - mmap_locked = false;
> - result = collapse_scan_file(mm, addr, file, pgoff, cc);
> - fput(file);
> - } else {
> - result = collapse_scan_pmd(mm, vma, addr,
> - &mmap_locked, cc);
> - }
> + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> +
Ack the fact you noted the behaviour change re:
collapse_test_exit_or_disable() that seems fine.
> if (!mmap_locked)
> *lock_dropped = true;
>
> -handle_result:
> switch (result) {
> case SCAN_SUCCEED:
> case SCAN_PMD_MAPPED:
> ++thps;
> break;
> - case SCAN_PTE_MAPPED_HUGEPAGE:
> - BUG_ON(mmap_locked);
> - mmap_read_lock(mm);
> - result = collapse_pte_mapped_thp(mm, addr, true);
> - mmap_read_unlock(mm);
> - goto handle_result;
One thing that differs with new code her is we filter SCAN_PMD_MAPPED to
SCAN_SUCCEED.
I was about to say 'but ++thps - is this correct' but now I realise this
was looping back on itself with a goto to do just that... ugh ye gads.
Anwyay that's fine because it doesn't change anything.
Re: switch statement in general, again would be good to always have each
scan possibility in switch statements, but perhaps given so many not
practical :)
(that way the compiler warns on missing a newly added enum val)
> /* Whitelisted set of results where continuing OK */
> + case SCAN_PTE_MAPPED_HUGEPAGE:
> case SCAN_PMD_NULL:
> case SCAN_PTE_NON_PRESENT:
> case SCAN_PTE_UFFD_WP:
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-19 13:41 ` [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-08-20 13:23 ` Lorenzo Stoakes
2025-08-20 15:40 ` Nico Pache
2025-08-24 1:37 ` Wei Yang
1 sibling, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-20 13:23 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:41:55AM -0600, Nico Pache wrote:
> For khugepaged to support different mTHP orders, we must generalize this
> to check if the PMD is not shared by another VMA and the order is enabled.
>
> To ensure madvise_collapse can support working on mTHP orders without the
> PMD order enabled, we need to convert hugepage_vma_revalidate to take a
> bitmap of orders.
>
> No functional change in this patch.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
LGTM (modulo nit/query below) so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/khugepaged.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b7b98aebb670..2d192ec961d2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -917,7 +917,7 @@ static int collapse_find_target_node(struct collapse_control *cc)
> static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> bool expect_anon,
> struct vm_area_struct **vmap,
> - struct collapse_control *cc)
> + struct collapse_control *cc, unsigned long orders)
> {
> struct vm_area_struct *vma;
> enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> @@ -930,9 +930,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> if (!vma)
> return SCAN_VMA_NULL;
>
> + /* Always check the PMD order to insure its not shared by another VMA */
NIT: ensure not insure.
> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> return SCAN_ADDRESS_RANGE;
> - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
> return SCAN_VMA_CHECK;
> /*
> * Anon VMA expected, the address may be unmapped then
> @@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> goto out_nolock;
>
> mmap_read_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> + BIT(HPAGE_PMD_ORDER));
Shouldn't this be PMD order? Seems equivalent.
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> @@ -1168,7 +1170,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> * mmap_lock.
> */
> mmap_write_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> + BIT(HPAGE_PMD_ORDER));
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> @@ -2807,7 +2810,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> mmap_read_lock(mm);
> mmap_locked = true;
> result = hugepage_vma_revalidate(mm, addr, false, &vma,
> - cc);
> + cc, BIT(HPAGE_PMD_ORDER));
> if (result != SCAN_SUCCEED) {
> last_fail = result;
> goto out_nolock;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 04/13] khugepaged: generalize alloc_charge_folio()
2025-08-19 13:41 ` [PATCH v10 04/13] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2025-08-20 13:28 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-20 13:28 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:41:56AM -0600, Nico Pache wrote:
> From: Dev Jain <dev.jain@arm.com>
>
> Pass order to alloc_charge_folio() and update mTHP statistics.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
LGTM so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 8 ++++++++
> include/linux/huge_mm.h | 2 ++
> mm/huge_memory.c | 4 ++++
> mm/khugepaged.c | 17 +++++++++++------
> 4 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index a16a04841b96..7ccb93e22852 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -630,6 +630,14 @@ anon_fault_fallback_charge
> instead falls back to using huge pages with lower orders or
> small pages even though the allocation was successful.
>
> +collapse_alloc
> + is incremented every time a huge page is successfully allocated for a
> + khugepaged collapse.
> +
> +collapse_alloc_failed
> + is incremented every time a huge page allocation fails during a
> + khugepaged collapse.
> +
> zswpout
> is incremented every time a huge page is swapped out to zswap in one
> piece without splitting.
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1ac0d06fb3c1..4ada5d1f7297 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -128,6 +128,8 @@ enum mthp_stat_item {
> MTHP_STAT_ANON_FAULT_ALLOC,
> MTHP_STAT_ANON_FAULT_FALLBACK,
> MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> + MTHP_STAT_COLLAPSE_ALLOC,
> + MTHP_STAT_COLLAPSE_ALLOC_FAILED,
> MTHP_STAT_ZSWPOUT,
> MTHP_STAT_SWPIN,
> MTHP_STAT_SWPIN_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index aac5f0a2cb54..20d005c2c61f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -621,6 +621,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
> DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
> DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
> DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
> DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
> DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
> DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
> @@ -686,6 +688,8 @@ static struct attribute *any_stats_attrs[] = {
> #endif
> &split_attr.attr,
> &split_failed_attr.attr,
> + &collapse_alloc_attr.attr,
> + &collapse_alloc_failed_attr.attr,
> NULL,
> };
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2d192ec961d2..77e0d8ee59a0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1079,21 +1079,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> }
>
> static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> - struct collapse_control *cc)
> + struct collapse_control *cc, unsigned int order)
> {
> gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> GFP_TRANSHUGE);
> int node = collapse_find_target_node(cc);
> struct folio *folio;
>
> - folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> + folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
> if (!folio) {
> *foliop = NULL;
> - count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> + if (order == HPAGE_PMD_ORDER)
> + count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
> return SCAN_ALLOC_HUGE_PAGE_FAIL;
> }
>
> - count_vm_event(THP_COLLAPSE_ALLOC);
> + if (order == HPAGE_PMD_ORDER)
> + count_vm_event(THP_COLLAPSE_ALLOC);
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
> +
> if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
> folio_put(folio);
> *foliop = NULL;
> @@ -1130,7 +1135,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> */
> mmap_read_unlock(mm);
>
> - result = alloc_charge_folio(&folio, mm, cc);
> + result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
>
> @@ -1863,7 +1868,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>
> - result = alloc_charge_folio(&new_folio, mm, cc);
> + result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out;
>
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support
2025-08-19 13:41 ` [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2025-08-20 14:22 ` Lorenzo Stoakes
2025-09-01 16:15 ` David Hildenbrand
0 siblings, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-20 14:22 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:41:57AM -0600, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* functions
> to support future mTHP collapse.
>
> mTHP collapse can suffer from incosistant behavior, and memory waste
> "creep". disable swapin and shared support for mTHP collapse.
>
> No functional changes in this patch.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 62 ++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 43 insertions(+), 19 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 77e0d8ee59a0..074101d03c9d 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -551,15 +551,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long address,
> pte_t *pte,
> struct collapse_control *cc,
> - struct list_head *compound_pagelist)
> + struct list_head *compound_pagelist,
> + unsigned int order)
I think it's better if we keep the output var as the last in the order. It's a
bit weird to have the order specified here.
> {
> struct page *page = NULL;
> struct folio *folio = NULL;
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> bool writable = false;
> + int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
This is a weird formulation, I guess we have to go with it to keep things
consistent-ish, but it's like we have a value for this that is reliant on the
order always being PMD and then sort of awkwardly adjusting for MTHP.
I guess we're stuck with it though since we have:
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
I guess a more sane version of this would be a ratio or something...
Anyway probably out of scope here.
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> + for (_pte = pte; _pte < pte + (1 << order);
Hmm is this correct? I think shifting an int is probably a bad idea even if we
can get away with it for even PUD order atm (though... 64KB ARM hm), wouldn't
_BITUL(order) be better?
Might also be worth putting into a separate local var.
> _pte++, address += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> if (pte_none(pteval) || (pte_present(pteval) &&
> @@ -567,7 +569,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> ++none_or_zero;
> if (!userfaultfd_armed(vma) &&
> (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> + none_or_zero <= scaled_max_ptes_none)) {
> continue;
> } else {
> result = SCAN_EXCEED_NONE_PTE;
> @@ -595,8 +597,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> /* See collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> ++shared;
> - if (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared) {
> + /*
> + * TODO: Support shared pages without leading to further
> + * mTHP collapses. Currently bringing in new pages via
> + * shared may cause a future higher order collapse on a
> + * rescan of the same range.
> + */
Can we document this if not already in a subsequent commit? :) Thanks
> + if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> + shared > khugepaged_max_ptes_shared)) {
> result = SCAN_EXCEED_SHARED_PTE;
> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> goto out;
> @@ -697,15 +705,16 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> struct vm_area_struct *vma,
> unsigned long address,
> spinlock_t *ptl,
> - struct list_head *compound_pagelist)
> + struct list_head *compound_pagelist,
> + unsigned int order)
> {
> - unsigned long end = address + HPAGE_PMD_SIZE;
> + unsigned long end = address + (PAGE_SIZE << order);
> struct folio *src, *tmp;
> pte_t pteval;
> pte_t *_pte;
> unsigned int nr_ptes;
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> + for (_pte = pte; _pte < pte + (1 << order); _pte += nr_ptes,
Same comment as above re: 1 << order.
> address += nr_ptes * PAGE_SIZE) {
> nr_ptes = 1;
> pteval = ptep_get(_pte);
> @@ -761,7 +770,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> pmd_t *pmd,
> pmd_t orig_pmd,
> struct vm_area_struct *vma,
> - struct list_head *compound_pagelist)
> + struct list_head *compound_pagelist,
> + unsigned int order)
Same comment as above re: parameter ordering.
> {
> spinlock_t *pmd_ptl;
>
> @@ -778,7 +788,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> * Release both raw and compound pages isolated
> * in __collapse_huge_page_isolate.
> */
> - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> + release_pte_pages(pte, pte + (1 << order), compound_pagelist);
> }
>
> /*
> @@ -799,7 +809,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> unsigned long address, spinlock_t *ptl,
> - struct list_head *compound_pagelist)
> + struct list_head *compound_pagelist, unsigned int order)
Same comment as before re: parameter ordering
> {
> unsigned int i;
> int result = SCAN_SUCCEED;
> @@ -807,7 +817,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> /*
> * Copying pages' contents is subject to memory poison at any iteration.
> */
> - for (i = 0; i < HPAGE_PMD_NR; i++) {
> + for (i = 0; i < (1 << order); i++) {
Same comment as before about 1 << order
> pte_t pteval = ptep_get(pte + i);
> struct page *page = folio_page(folio, i);
> unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -826,10 +836,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>
> if (likely(result == SCAN_SUCCEED))
> __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> - compound_pagelist);
> + compound_pagelist, order);
> else
> __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> - compound_pagelist);
> + compound_pagelist, order);
>
> return result;
> }
> @@ -1005,11 +1015,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> static int __collapse_huge_page_swapin(struct mm_struct *mm,
> struct vm_area_struct *vma,
> unsigned long haddr, pmd_t *pmd,
> - int referenced)
> + int referenced, unsigned int order)
> {
> int swapped_in = 0;
> vm_fault_t ret = 0;
> - unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> + unsigned long address, end = haddr + (PAGE_SIZE << order);
> int result;
> pte_t *pte = NULL;
> spinlock_t *ptl;
> @@ -1040,6 +1050,19 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> if (!is_swap_pte(vmf.orig_pte))
> continue;
>
> + /*
> + * TODO: Support swapin without leading to further mTHP
> + * collapses. Currently bringing in new pages via swapin may
> + * cause a future higher order collapse on a rescan of the same
> + * range.
> + */
> + if (order != HPAGE_PMD_ORDER) {
> + pte_unmap(pte);
> + mmap_read_unlock(mm);
> + result = SCAN_EXCEED_SWAP_PTE;
> + goto out;
> + }
> +
> vmf.pte = pte;
> vmf.ptl = ptl;
> ret = do_swap_page(&vmf);
> @@ -1160,7 +1183,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> * that case. Continuing to collapse causes inconsistency.
> */
> result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced);
> + referenced, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
> @@ -1208,7 +1231,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> if (pte) {
> result = __collapse_huge_page_isolate(vma, address, pte, cc,
> - &compound_pagelist);
> + &compound_pagelist,
> + HPAGE_PMD_ORDER);
> spin_unlock(pte_ptl);
> } else {
> result = SCAN_PMD_NULL;
> @@ -1238,7 +1262,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> vma, address, pte_ptl,
> - &compound_pagelist);
> + &compound_pagelist, HPAGE_PMD_ORDER);
> pte_unmap(pte);
> if (unlikely(result != SCAN_SUCCEED))
> goto out_up_write;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-20 13:23 ` Lorenzo Stoakes
@ 2025-08-20 15:40 ` Nico Pache
2025-08-21 3:41 ` Wei Yang
0 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-20 15:40 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Wed, Aug 20, 2025 at 7:25 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 19, 2025 at 07:41:55AM -0600, Nico Pache wrote:
> > For khugepaged to support different mTHP orders, we must generalize this
> > to check if the PMD is not shared by another VMA and the order is enabled.
> >
> > To ensure madvise_collapse can support working on mTHP orders without the
> > PMD order enabled, we need to convert hugepage_vma_revalidate to take a
> > bitmap of orders.
> >
> > No functional change in this patch.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> LGTM (modulo nit/query below) so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Thanks :)
>
> > ---
> > mm/khugepaged.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b7b98aebb670..2d192ec961d2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -917,7 +917,7 @@ static int collapse_find_target_node(struct collapse_control *cc)
> > static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > bool expect_anon,
> > struct vm_area_struct **vmap,
> > - struct collapse_control *cc)
> > + struct collapse_control *cc, unsigned long orders)
> > {
> > struct vm_area_struct *vma;
> > enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> > @@ -930,9 +930,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > if (!vma)
> > return SCAN_VMA_NULL;
> >
> > + /* Always check the PMD order to insure its not shared by another VMA */
>
> NIT: ensure not insure.
ack, ill fix that!
>
> > if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> > return SCAN_ADDRESS_RANGE;
> > - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> > + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
> > return SCAN_VMA_CHECK;
> > /*
> > * Anon VMA expected, the address may be unmapped then
> > @@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > goto out_nolock;
> >
> > mmap_read_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> > + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > + BIT(HPAGE_PMD_ORDER));
>
> Shouldn't this be PMD order? Seems equivalent.
Yeah i'm actually not sure why we have both... they seem to be the
same thing, but perhaps there is some reason for having two...
>
> > if (result != SCAN_SUCCEED) {
> > mmap_read_unlock(mm);
> > goto out_nolock;
> > @@ -1168,7 +1170,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > * mmap_lock.
> > */
> > mmap_write_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> > + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > + BIT(HPAGE_PMD_ORDER));
> > if (result != SCAN_SUCCEED)
> > goto out_up_write;
> > /* check if the pmd is still valid */
> > @@ -2807,7 +2810,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > mmap_read_lock(mm);
> > mmap_locked = true;
> > result = hugepage_vma_revalidate(mm, addr, false, &vma,
> > - cc);
> > + cc, BIT(HPAGE_PMD_ORDER));
> > if (result != SCAN_SUCCEED) {
> > last_fail = result;
> > goto out_nolock;
> > --
> > 2.50.1
> >
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-19 21:55 ` [PATCH v10 00/13] khugepaged: mTHP support Andrew Morton
@ 2025-08-20 15:55 ` Nico Pache
0 siblings, 0 replies; 75+ messages in thread
From: Nico Pache @ 2025-08-20 15:55 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, baohua,
willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
mhocko, rdunlap, hughd
On Tue, Aug 19, 2025 at 3:55 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 19 Aug 2025 07:41:52 -0600 Nico Pache <npache@redhat.com> wrote:
>
> > The following series provides khugepaged with the capability to collapse
> > anonymous memory regions to mTHPs.
> >
> > ...
> >
> > - I created a test script that I used to push khugepaged to its limits
> > while monitoring a number of stats and tracepoints. The code is
> > available here[1] (Run in legacy mode for these changes and set mthp
> > sizes to inherit)
>
> Could this be turned into something in tools/testing/selftests/mm/?
Yep! I was actually working on some selftests for this before I hit a
weird bug during some of my testing, which took precedence over that.
I was planning on sending a separate series for the testing. One of
the pain points was that selftests helpers were set up for PMDs, but I
think Baolin just cleaned that up in his khugepaged mTHP shmem series
(still need to review). So it should be a lot easier to implement now.
>
> > V10 Changes:
>
> I'll add this to mm-new, thanks. I'll suppress the usual emails.
Thank you :)
-- Nico
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2025-08-20 11:21 ` Lorenzo Stoakes
@ 2025-08-20 16:35 ` Nico Pache
2025-08-22 10:21 ` Lorenzo Stoakes
0 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-20 16:35 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Wed, Aug 20, 2025 at 5:22 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 19, 2025 at 07:41:54AM -0600, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create collapse_single_pmd to increase code reuse and create an entry
> > point to these two users.
> >
> > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > collapse_single_pmd function. This introduces a minor behavioral change
> > that is most likely an undiscovered bug. The current implementation of
> > khugepaged tests collapse_test_exit_or_disable before calling
> > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > case. By unifying these two callers madvise_collapse now also performs
> > this check.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> > mm/khugepaged.c | 94 ++++++++++++++++++++++++++-----------------------
> > 1 file changed, 49 insertions(+), 45 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 0e7bbadf03ee..b7b98aebb670 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2382,6 +2382,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > return result;
> > }
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static int collapse_single_pmd(unsigned long addr,
> > + struct vm_area_struct *vma, bool *mmap_locked,
> > + struct collapse_control *cc)
> > +{
> > + int result = SCAN_FAIL;
>
> You assign result in all branches, so this can be uninitialised.
ack, thanks.
>
> > + struct mm_struct *mm = vma->vm_mm;
> > +
> > + if (!vma_is_anonymous(vma)) {
> > + struct file *file = get_file(vma->vm_file);
> > + pgoff_t pgoff = linear_page_index(vma, addr);
> > +
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > + fput(file);
> > + if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > + mmap_read_lock(mm);
> > + *mmap_locked = true;
> > + if (collapse_test_exit_or_disable(mm)) {
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + result = SCAN_ANY_PROCESS;
> > + goto end;
>
> Don't love that in e.g. collapse_scan_mm_slot() we are using the mmap lock being
> disabled as in effect an error code.
>
> Is SCAN_ANY_PROCESS correct here? Because in collapse_scan_mm_slot() you'd
> previously:
https://lore.kernel.org/lkml/a881ed65-351a-469f-b625-a3066d0f1d5c@linux.alibaba.com/
Baolin brought up a good point a while back that if
collapse_test_exit_or_disable returns true we will be breaking out of
the loop and should change the return value to indicate this. So to
combine the madvise breakout and the scan_slot breakout we drop the
lock and return SCAN_ANY_PROCESS.
>
> if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> mmap_read_lock(mm);
> if (collapse_test_exit_or_disable(mm))
> goto breakouterloop;
> ...
> }
>
> But now you're setting result = SCAN_ANY_PROCESS rather than
> SCAN_PTE_MAPPED_HUGEPAGE in this instance?
>
> You don't mention that you're changing this, or at least explicitly enough,
> the commit message should state that you're changing this and explain why
> it's ok.
I do state it but perhaps I need to be more verbose! I will update the
message to state we are also changing the result value too.
>
> This whole file is horrid, and it's kinda an aside, but I really wish we
> had some comment going through each of the scan_result cases and explaining
> what each one meant.
Yeah its been a huge pain to have to investigate what everything is
supposed to mean, and I often have to go searching to confirm things.
include/trace/events/huge_memory.h has a "good" summary of them
>
> Also I think:
>
> return SCAN_ANY_PROCESS;
>
> Is better than:
>
> result = SCAN_ANY_PROCESS;
> goto end;
I agree! I will change that :)
> ...
> end:
> return result;
>
> > + }
> > + result = collapse_pte_mapped_thp(mm, addr,
> > + !cc->is_khugepaged);
>
> Hm another change here, in the original code in collapse_scan_mm_slot()
> this is:
>
> *result = collapse_pte_mapped_thp(mm,
> khugepaged_scan.address, false);
>
> Presumably collapse_scan_mm_slot() is only ever invoked with
> cc->is_khugepaged?
Correct, but the madvise_collapse calls this with true, hence why it
now depends on the is_khugepaged variable. No functional change here.
>
> Maybe worth adding a VM_WARN_ON_ONCE(!cc->is_khugepaged) at the top of
> collapse_scan_mm_slot() to assert this (and other places where your change
> assumes this to be the case).
Ok I will investigate doing that but it would take a huge mistake to
hit that assertion.
>
>
> > + if (result == SCAN_PMD_MAPPED)
> > + result = SCAN_SUCCEED;
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + }
> > + } else {
> > + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > + }
> > + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > + ++khugepaged_pages_collapsed;
>
> Similarly, presumably because collapse_scan_mm_slot() only ever invoked
> khugepaged case this didn't have the cc->is_khugepaged check?
Correct, we only increment this when its khugepaged, so we need to
guard it so madvise collapse wont increment this.
>
> > +end:
> > + return result;
> > +}
>
> There's a LOT of nesting going on here, I think we can simplify this a
> lot. If we make the change I noted above re: returning SCAN_ANY_PROCESS< we
> can move the end label up a bit and avoid a ton of nesting, e.g.:
Ah I like this much more, I will try to implement/test it.
>
> static int collapse_single_pmd(unsigned long addr,
> struct vm_area_struct *vma, bool *mmap_locked,
> struct collapse_control *cc)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct file *file;
> pgoff_t pgoff;
> int result;
>
> if (vma_is_anonymous(vma)) {
> result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> goto end:
> }
>
> file = get_file(vma->vm_file);
> pgoff = linear_page_index(vma, addr);
>
> mmap_read_unlock(mm);
> *mmap_locked = false;
> result = collapse_scan_file(mm, addr, file, pgoff, cc);
> fput(file);
> if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> goto end;
>
> mmap_read_lock(mm);
> *mmap_locked = true;
> if (collapse_test_exit_or_disable(mm)) {
> mmap_read_unlock(mm);
> *mmap_locked = false;
> return SCAN_ANY_PROCESS;
> }
> result = collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> if (result == SCAN_PMD_MAPPED)
> result = SCAN_SUCCEED;
> mmap_read_unlock(mm);
> *mmap_locked = false;
>
> end:
> if (cc->is_khugepaged && result == SCAN_SUCCEED)
> ++khugepaged_pages_collapsed;
>
> return result;
> }
>
> (untested, thrown together so do double check!)
>
> > +
> > static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> > struct collapse_control *cc)
> > __releases(&khugepaged_mm_lock)
> > @@ -2455,34 +2499,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> > VM_BUG_ON(khugepaged_scan.address < hstart ||
> > khugepaged_scan.address + HPAGE_PMD_SIZE >
> > hend);
> > - if (!vma_is_anonymous(vma)) {
> > - struct file *file = get_file(vma->vm_file);
> > - pgoff_t pgoff = linear_page_index(vma,
> > - khugepaged_scan.address);
> > -
> > - mmap_read_unlock(mm);
> > - mmap_locked = false;
> > - *result = collapse_scan_file(mm,
> > - khugepaged_scan.address, file, pgoff, cc);
> > - fput(file);
> > - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > - mmap_read_lock(mm);
> > - if (collapse_test_exit_or_disable(mm))
> > - goto breakouterloop;
> > - *result = collapse_pte_mapped_thp(mm,
> > - khugepaged_scan.address, false);
> > - if (*result == SCAN_PMD_MAPPED)
> > - *result = SCAN_SUCCEED;
> > - mmap_read_unlock(mm);
> > - }
> > - } else {
> > - *result = collapse_scan_pmd(mm, vma,
> > - khugepaged_scan.address, &mmap_locked, cc);
> > - }
> > -
> > - if (*result == SCAN_SUCCEED)
> > - ++khugepaged_pages_collapsed;
> >
> > + *result = collapse_single_pmd(khugepaged_scan.address,
> > + vma, &mmap_locked, cc);
> > /* move to next address */
> > khugepaged_scan.address += HPAGE_PMD_SIZE;
> > progress += HPAGE_PMD_NR;
> > @@ -2799,34 +2818,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > mmap_assert_locked(mm);
> > memset(cc->node_load, 0, sizeof(cc->node_load));
> > nodes_clear(cc->alloc_nmask);
> > - if (!vma_is_anonymous(vma)) {
> > - struct file *file = get_file(vma->vm_file);
> > - pgoff_t pgoff = linear_page_index(vma, addr);
> >
> > - mmap_read_unlock(mm);
> > - mmap_locked = false;
> > - result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > - fput(file);
> > - } else {
> > - result = collapse_scan_pmd(mm, vma, addr,
> > - &mmap_locked, cc);
> > - }
> > + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > +
>
> Ack the fact you noted the behaviour change re:
> collapse_test_exit_or_disable() that seems fine.
>
> > if (!mmap_locked)
> > *lock_dropped = true;
> >
> > -handle_result:
> > switch (result) {
> > case SCAN_SUCCEED:
> > case SCAN_PMD_MAPPED:
> > ++thps;
> > break;
> > - case SCAN_PTE_MAPPED_HUGEPAGE:
> > - BUG_ON(mmap_locked);
> > - mmap_read_lock(mm);
> > - result = collapse_pte_mapped_thp(mm, addr, true);
> > - mmap_read_unlock(mm);
> > - goto handle_result;
>
> One thing that differs with new code her is we filter SCAN_PMD_MAPPED to
> SCAN_SUCCEED.
>
> I was about to say 'but ++thps - is this correct' but now I realise this
> was looping back on itself with a goto to do just that... ugh ye gads.
>
> Anwyay that's fine because it doesn't change anything.
>
> Re: switch statement in general, again would be good to always have each
> scan possibility in switch statements, but perhaps given so many not
> practical :)
Yeah it may be worth investigating for future changes I have for
khugepaged (including the new switch statement I implement later and
you commented on)
>
> (that way the compiler warns on missing a newly added enum val)
>
> > /* Whitelisted set of results where continuing OK */
> > + case SCAN_PTE_MAPPED_HUGEPAGE:
> > case SCAN_PMD_NULL:
> > case SCAN_PTE_NON_PRESENT:
> > case SCAN_PTE_UFFD_WP:
> > --
Thanks for the review :)
-- Nico
> > 2.50.1
> >
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 06/13] khugepaged: add mTHP support
2025-08-19 13:41 ` [PATCH v10 06/13] khugepaged: add " Nico Pache
@ 2025-08-20 18:29 ` Lorenzo Stoakes
2025-09-02 20:12 ` Nico Pache
0 siblings, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-20 18:29 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:41:58AM -0600, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning PMD ranges for potential collapse candidates, keep track
> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> mTHPs are enabled we remove the restriction of max_ptes_none during the
> scan phase so we don't bailout early and miss potential mTHP candidates.
>
> A new function collapse_scan_bitmap is used to perform binary recursion on
> the bitmap and determine the best eligible order for the collapse.
> A stack struct is used instead of traditional recursion. max_ptes_none
> will be scaled by the attempted collapse order to determine how "full" an
> order must be before being considered for collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory.
>
> For orders configured with "always", we perform greedy collapsing
> to that order without considering bit density.
>
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we don't perform the collapse. This is because adding new entries
> can lead to new none pages, and these may lead to constant promotion into
> a higher order (m)THP. A similar issue can occur with "max_ptes_none >
> HPAGE_PMD_NR/2" due to the fact that a collapse will introduce at least 2x
> the number of pages, and on a future scan will satisfy the promotion
> condition once again.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the non-PMD case this is not true, and we must keep the lock to prevent
> changes to the VMA from occurring.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
Yes I think this has to remain the case unfortunately as we override
sysfs-specified orders for MADV_COLLAPSE and there's no sensible way to
determine what order we ought to be using.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
You've gone from small incremental changes to a huge one here... for the
sake of reviewer sanity at least, any chance of breaking this up?
> ---
> include/linux/khugepaged.h | 4 +
> mm/khugepaged.c | 236 +++++++++++++++++++++++++++++--------
> 2 files changed, 188 insertions(+), 52 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index eb1946a70cff..d12cdb9ef3ba 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,6 +1,10 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #ifndef _LINUX_KHUGEPAGED_H
> #define _LINUX_KHUGEPAGED_H
> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
I guess this makes sense as by definition 2 pages is least it could
possibly be.
> +#define KHUGEPAGED_MIN_MTHP_NR (1 << KHUGEPAGED_MIN_MTHP_ORDER)
Surely KHUGEPAGED_MIN_NR_MTHP_PTES would be more meaningful?
> +#define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
This is confusing - size of what?
If it's number of bits surely this should be ilog2(MAX_PTRS_PER_PTE) -
KHUGEPAGED_MIN_MTHP_ORDER?
This seems to be more so 'the maximum value that could contain the bits right?
I think this is just wrong though, see below at DECLARE_BITMAP() stuff.
> +#define MTHP_BITMAP_SIZE (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
Hard to know how this relates to MAX_MTHP_BITMAP_SIZE?
I guess this is the current bitmap size indicating all that is possible,
but if these are all #define's what is this accomplishing?
For all - please do not do (1 << xxx)! This can lead to sign-extension bugs at least
in theory, use _BITUL(...), it's neater too.
NIT but the whitespace is all screwed up here.
KHUGEPAGED_MIN_MTHP_ORDER and KHUGEPAGED_MIN_MTHP_NR
>
> #include <linux/mm.h>
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 074101d03c9d..1ad7e00d3fd6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
> static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +struct scan_bit_state {
> + u8 order;
> + u16 offset;
> +};
> +
> struct collapse_control {
> bool is_khugepaged;
>
> @@ -102,6 +107,18 @@ struct collapse_control {
>
> /* nodemask for allocation fallback */
> nodemask_t alloc_nmask;
> +
> + /*
> + * bitmap used to collapse mTHP sizes.
> + * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
I'm not sure what this '1bit = xxx' comment means?
> + */
> + DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
Hmm this seems wrong.
DECLARE_BITMAP(..., val) is expessed as:
#define DECLARE_BITMAP(name,bits) \
unsigned long name[BITS_TO_LONGS(bits)]
So the 2nd param should be number of bits.
But MAX_MTHP_BITMAP_SIZE is:
(1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
So typically:
(1 << (9 - 2)) = 128
And BITS_TO_LONGS is defined as:
__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
So essentially this will be 128 / 8 on a 64-bit system so 16 bytes to
store... 7 bits?
Unless I'm missing something here?
> + DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
Same comment as above obviously. But also this is kind of horrible, why are
we putting a copy of this entire bitmap on the stack every time we declare
a cc?
> + struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> + .is_khugepaged = true,
> };
Why are we moving this here?
>
> /**
> @@ -854,10 +871,6 @@ static void khugepaged_alloc_sleep(void)
> remove_wait_queue(&khugepaged_wait, &wait);
> }
>
> -struct collapse_control khugepaged_collapse_control = {
> - .is_khugepaged = true,
> -};
> -
> static bool collapse_scan_abort(int nid, struct collapse_control *cc)
> {
> int i;
> @@ -1136,17 +1149,19 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>
> static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> int referenced, int unmapped,
> - struct collapse_control *cc)
> + struct collapse_control *cc, bool *mmap_locked,
> + unsigned int order, unsigned long offset)
> {
> LIST_HEAD(compound_pagelist);
> pmd_t *pmd, _pmd;
> - pte_t *pte;
> + pte_t *pte = NULL, mthp_pte;
> pgtable_t pgtable;
> struct folio *folio;
> spinlock_t *pmd_ptl, *pte_ptl;
> int result = SCAN_FAIL;
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> + unsigned long _address = address + offset * PAGE_SIZE;
This name is really horrible. please name it sensibly.
It feels like address ought to be consistently the base of the THP or mTHP
we wish to collapse, and if we need something PMD aligned for some reason
we should rename _that_ to e.g. pmd_address.
Orrr it could be mthp_address...
Perhaps we could just figure that out here and pass only the
address... aligning to PMD boundary shouldn't be hard/costly.
But it may indicate we need further refactorisation so we don't need to
paper over cracks + pass around a PMD address to do things when that may
not be where the (m)THP range begins.
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1155,16 +1170,20 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> * The allocation can take potentially a long time if it involves
> * sync compaction, and we do not need to hold the mmap_lock during
> * that. We will recheck the vma after taking it again in write mode.
> + * If collapsing mTHPs we may have already released the read_lock.
> */
> - mmap_read_unlock(mm);
> + if (*mmap_locked) {
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + }
>
> - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> + result = alloc_charge_folio(&folio, mm, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
>
> mmap_read_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - BIT(HPAGE_PMD_ORDER));
> + *mmap_locked = true;
> + result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
I mean this is kind of going back to previous commits, but it's really ugly
to pass a BIT(xxx) here, is that really necessary? Can't we just pass in
the order?
It's also inconsistent with other calls like
e.g. __collapse_huge_page_swapin() below which passes the order.
Same goes obv. for all such invocations.
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> @@ -1182,13 +1201,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> * released when it fails. So we jump out_nolock directly in
> * that case. Continuing to collapse causes inconsistency.
> */
> - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced, HPAGE_PMD_ORDER);
> + result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> + referenced, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
>
> mmap_read_unlock(mm);
> + *mmap_locked = false;
> /*
> * Prevent all access to pagetables with the exception of
> * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1198,8 +1218,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> * mmap_lock.
> */
> mmap_write_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - BIT(HPAGE_PMD_ORDER));
> + result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> @@ -1210,11 +1229,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
> anon_vma_lock_write(vma->anon_vma);
>
> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> - address + HPAGE_PMD_SIZE);
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> + _address + (PAGE_SIZE << order));
This _address is horrible. That really does have to change.
> mmu_notifier_invalidate_range_start(&range);
>
> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
Odd whitespace...
> /*
> * This removes any huge TLB entry from the CPU so we won't allow
> * huge and small TLB entries for the same virtual address to
> @@ -1228,19 +1248,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> mmu_notifier_invalidate_range_end(&range);
> tlb_remove_table_sync_one();
>
> - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> + pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
I see we already have a 'convention' of _ prefix on the pmd param, but two
wrongs don't make a right...
> if (pte) {
> - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> - &compound_pagelist,
> - HPAGE_PMD_ORDER);
> + result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> + &compound_pagelist, order);
> spin_unlock(pte_ptl);
> } else {
> result = SCAN_PMD_NULL;
> }
>
> if (unlikely(result != SCAN_SUCCEED)) {
> - if (pte)
> - pte_unmap(pte);
Why are we removing this?
> spin_lock(pmd_ptl);
> BUG_ON(!pmd_none(*pmd));
> /*
> @@ -1255,17 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> }
>
> /*
> - * All pages are isolated and locked so anon_vma rmap
> - * can't run anymore.
> + * For PMD collapse all pages are isolated and locked so anon_vma
> + * rmap can't run anymore
> */
> - anon_vma_unlock_write(vma->anon_vma);
> + if (order == HPAGE_PMD_ORDER)
> + anon_vma_unlock_write(vma->anon_vma);
Hmm this is introducing a horrible new way for things to go wrong. And
there's now a whole host of terrible error paths that can go wrong very
easily around rmap locks and yeah, no way we cannot do it this way.
rmap locks are VERY sensitive and the ordering of the locking matters a
great deal (see top of mm/rmap.c). So we have to be SO careful here.
I suggest you simply have a boolean 'anon_vma_locked' or something like
this, and get rid of these horrible additional code paths and the second
order == HPAGE_PMD_ORDER check.
We'll track whether or not the lock is held and thereby needs releasing
that way instead.
Also, and very importantly - are you 100% sure you can't possibly have a
deadlock or issue beyond this point if you don't release the rmap lock?
This is veeeery important, as there can be implicit assumptions around
whether or not one can acquire these locks and you basically have to audit
ALL code over which this lock is held.
I'm speaking from hard experience here having bumped into this in various
attempts at work relating to this stuff...
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> - vma, address, pte_ptl,
> - &compound_pagelist, HPAGE_PMD_ORDER);
> - pte_unmap(pte);
> + vma, _address, pte_ptl,
> + &compound_pagelist, order);
> if (unlikely(result != SCAN_SUCCEED))
> - goto out_up_write;
> + goto out_unlock_anon_vma;
See above...
>
> /*
> * The smp_wmb() inside __folio_mark_uptodate() ensures the
> @@ -1273,33 +1290,115 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> * write.
> */
> __folio_mark_uptodate(folio);
> - pgtable = pmd_pgtable(_pmd);
> -
> - _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> - spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> - folio_add_lru_vma(folio, vma);
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> - set_pmd_at(mm, address, pmd, _pmd);
> - update_mmu_cache_pmd(vma, address, pmd);
> - deferred_split_folio(folio, false);
> - spin_unlock(pmd_ptl);
> + if (order == HPAGE_PMD_ORDER) {
> + pgtable = pmd_pgtable(_pmd);
> + _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> + _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> + spin_lock(pmd_ptl);
> + BUG_ON(!pmd_none(*pmd));
I know you're refactoring this, but be good to change this to a
WARN_ON_ONCE(), BUG_ON() is verboten unless it's absolutely definitely
going to be a kernel nuclear event, so worth changing things up as we go.
> + folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> + folio_add_lru_vma(folio, vma);
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + set_pmd_at(mm, address, pmd, _pmd);
> + update_mmu_cache_pmd(vma, address, pmd);
> + deferred_split_folio(folio, false);
> + spin_unlock(pmd_ptl);
> + } else { /* mTHP collapse */
> + mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
I guess it's a rule that each THP or mTHP range spanned must span one and
only one folio.
Not sure &folio->page has a future though.
Maybe better to use folio_page(folio, 0)?
> + mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> + spin_lock(pmd_ptl);
> + BUG_ON(!pmd_none(*pmd));
having said the above, this is trictly introducing a new BUG_ON() which is
a no-no, please make it a WARN_ON_ONCE().
> + folio_ref_add(folio, (1 << order) - 1);
Again no 1 << x please.
Do we do something similar somewhere else for mthp ref counting? Can we
share code somehow?
> + folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> + folio_add_lru_vma(folio, vma);
> + set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
Please avoid 1 << order, and I think at this point since you reference it a
bunch of times, just store a local var like nr_pages or sth?
> + update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +
> + smp_wmb(); /* make pte visible before pmd */
Can you give some detail as to why this will work here and why it is
necessary?
> + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
If we're updating PTE entriess why do we need to assign the PMD entry?
> + spin_unlock(pmd_ptl);
> + }
This deeply, badly needs to be refactored into something that both shares
code and separates out these two operations.
This function is disgustingly long as it is, and that's not your fault, but
let's try to make things better as we go.
>
> folio = NULL;
>
> result = SCAN_SUCCEED;
> +out_unlock_anon_vma:
> + if (order != HPAGE_PMD_ORDER)
> + anon_vma_unlock_write(vma->anon_vma);
Obviously again as above, we need to simplify this and get rid of this
whole bit.
> out_up_write:
> + if (pte)
> + pte_unmap(pte);
OK I guess you moved this from above down here? Is this a valid place to do this?
> mmap_write_unlock(mm);
> out_nolock:
> + *mmap_locked = false;
This is kind of horrible, we now have pretty mad logic around who sets
mmap_locked and where.
Can we just do this at the call sites so we avoid that?
I mean anything we do with this is hideous, but that'd be less confusing It
hink.
> if (folio)
> folio_put(folio);
> trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> return result;
> }
>
> +/* Recursive function to consume the bitmap */
Err... please don't? Kernel stack is a seriously finite resource, we do not
want recursion at all.
But I'm not actually seeing any recursion here? Am I missing something?
> +static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> + int referenced, int unmapped, struct collapse_control *cc,
> + bool *mmap_locked, unsigned long enabled_orders)
This is a complicated and confusing function, it requires a comment
describing how it works.
> +{
> + u8 order, next_order;
> + u16 offset, mid_offset;
> + int num_chunks;
> + int bits_set, threshold_bits;
> + int top = -1;
Err why do we start at -1 then immediately increment it?
> + int collapsed = 0;
> + int ret;
> + struct scan_bit_state state;
> + bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
Extraneous outer parens.
> +
> + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> + { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
This is the same as
cc->mthp_bitmap_stack[0] = ...;
top = 1;
No?
This is really horrible. Can we just have a helper function for this
please?
Like:
static int mthp_push_stack(struct collapse_control *cc,
int index, u8 order, u16 offset)
{
struct scan_bit_state *state = &cc->mthp_bitmap_stack[index];
VM_WARN_ON(index >= MAX_MTHP_BITMAP_SIZE);
state->order = order;
state->offset = offset;
return index + 1;
}
And can invoke via:
top = mthp_push_stack(cc, top, order, offset);
Or pass index as a pointer possibly also.
> +
> + while (top >= 0) {
> + state = cc->mthp_bitmap_stack[top--];
OK so this is the recursive bit...
Oh man this function so needs a comment describing what it does, seriously.
I think honestly for sake of my own sanity I'm going to hold off reviewing
the rest of this until there's something describing the algorithm, in
detail here, above the function.
> + order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> + offset = state.offset;
> + num_chunks = 1 << (state.order);
> + /* Skip mTHP orders that are not enabled */
> + if (!test_bit(order, &enabled_orders))
> + goto next_order;
> +
> + /* copy the relavant section to a new bitmap */
> + bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> + MTHP_BITMAP_SIZE);
> +
> + bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> + threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> + >> (HPAGE_PMD_ORDER - state.order);
> +
> + /* Check if the region is "almost full" based on the threshold */
> + if (bits_set > threshold_bits || is_pmd_only
> + || test_bit(order, &huge_anon_orders_always)) {
> + ret = collapse_huge_page(mm, address, referenced, unmapped,
> + cc, mmap_locked, order,
> + offset * KHUGEPAGED_MIN_MTHP_NR);
> + if (ret == SCAN_SUCCEED) {
> + collapsed += (1 << order);
> + continue;
> + }
> + }
> +
> +next_order:
> + if (state.order > 0) {
> + next_order = state.order - 1;
> + mid_offset = offset + (num_chunks / 2);
> + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> + { next_order, mid_offset };
> + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> + { next_order, offset };
> + }
> + }
> + return collapsed;
> +}
> +
> static int collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma,
> unsigned long address, bool *mmap_locked,
> @@ -1307,31 +1406,60 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> {
> pmd_t *pmd;
> pte_t *pte, *_pte;
> + int i;
> int result = SCAN_FAIL, referenced = 0;
> int none_or_zero = 0, shared = 0;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long _address;
> + unsigned long enabled_orders;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
> + bool is_pmd_only;
> bool writable = false;
> -
> + int chunk_none_count = 0;
> + int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
> + unsigned long tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> result = find_pmd_or_thp_or_none(mm, address, &pmd);
> if (result != SCAN_SUCCEED)
> goto out;
>
> + bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> + bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
Having this 'temp' thing on the stack for everyone is just horrid.
> memset(cc->node_load, 0, sizeof(cc->node_load));
> nodes_clear(cc->alloc_nmask);
> +
> + if (cc->is_khugepaged)
> + enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> + tva_flags, THP_ORDERS_ALL_ANON);
> + else
> + enabled_orders = BIT(HPAGE_PMD_ORDER);
> +
> + is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
This is horrid, can we have a function broken out to do this please?
In general if you keep open coding stuff, just write a static function for
it, the compiler is smart enough to inline.
> +
> pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> if (!pte) {
> result = SCAN_PMD_NULL;
> goto out;
> }
>
> - for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> - _pte++, _address += PAGE_SIZE) {
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
> + /*
> + * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
> + * there are pages in this chunk keep track of it in the bitmap
> + * for mTHP collapsing.
> + */
> + if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
> + if (i > 0 && chunk_none_count <= scaled_none)
> + bitmap_set(cc->mthp_bitmap,
> + (i - 1) / KHUGEPAGED_MIN_MTHP_NR, 1);
> + chunk_none_count = 0;
> + }
This whole thing is really confusing and you are not explaining the
algoritm here at all.
This requires a comment, and really this bit should be separated out please.
> +
> + _pte = pte + i;
> + _address = address + i * PAGE_SIZE;
> pte_t pteval = ptep_get(_pte);
> if (is_swap_pte(pteval)) {
> ++unmapped;
> @@ -1354,10 +1482,11 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> }
> }
> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> + ++chunk_none_count;
> ++none_or_zero;
> if (!userfaultfd_armed(vma) &&
> - (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> + (!cc->is_khugepaged || !is_pmd_only ||
> + none_or_zero <= khugepaged_max_ptes_none)) {
> continue;
> } else {
> result = SCAN_EXCEED_NONE_PTE;
> @@ -1453,6 +1582,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> address)))
> referenced++;
> }
> +
> if (!writable) {
> result = SCAN_PAGE_RO;
> } else if (cc->is_khugepaged &&
> @@ -1465,10 +1595,12 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> if (result == SCAN_SUCCEED) {
> - result = collapse_huge_page(mm, address, referenced,
> - unmapped, cc);
> - /* collapse_huge_page will return with the mmap_lock released */
> - *mmap_locked = false;
> + result = collapse_scan_bitmap(mm, address, referenced, unmapped, cc,
> + mmap_locked, enabled_orders);
> + if (result > 0)
> + result = SCAN_SUCCEED;
> + else
> + result = SCAN_FAIL;
We're reusing result as both an enum value and as a storage for unmber
colapsed PTE entries?
Can we just use a new local variable? Thanks
> }
> out:
> trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
> --
> 2.50.1
>
I will review the bitmap/chunk stuff in more detail once the algorithm is
commented.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-20 15:40 ` Nico Pache
@ 2025-08-21 3:41 ` Wei Yang
2025-08-21 14:09 ` Zi Yan
0 siblings, 1 reply; 75+ messages in thread
From: Wei Yang @ 2025-08-21 3:41 UTC (permalink / raw)
To: Nico Pache
Cc: Lorenzo Stoakes, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, david, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On Wed, Aug 20, 2025 at 09:40:40AM -0600, Nico Pache wrote:
[...]
>>
>> > if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>> > return SCAN_ADDRESS_RANGE;
>> > - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
>> > + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
>> > return SCAN_VMA_CHECK;
>> > /*
>> > * Anon VMA expected, the address may be unmapped then
>> > @@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> > goto out_nolock;
>> >
>> > mmap_read_lock(mm);
>> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>> > + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> > + BIT(HPAGE_PMD_ORDER));
>>
>> Shouldn't this be PMD order? Seems equivalent.
>Yeah i'm actually not sure why we have both... they seem to be the
>same thing, but perhaps there is some reason for having two...
I am confused with these two, PMD_ORDER above and HPAGE_PMD_ORDER from here.
Do we have a guide on when to use which?
>>
>> > if (result != SCAN_SUCCEED) {
>> > mmap_read_unlock(mm);
>> > goto out_nolock;
>> > @@ -1168,7 +1170,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> > * mmap_lock.
>> > */
>> > mmap_write_lock(mm);
>> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>> > + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> > + BIT(HPAGE_PMD_ORDER));
>> > if (result != SCAN_SUCCEED)
>> > goto out_up_write;
>> > /* check if the pmd is still valid */
>> > @@ -2807,7 +2810,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>> > mmap_read_lock(mm);
>> > mmap_locked = true;
>> > result = hugepage_vma_revalidate(mm, addr, false, &vma,
>> > - cc);
>> > + cc, BIT(HPAGE_PMD_ORDER));
>> > if (result != SCAN_SUCCEED) {
>> > last_fail = result;
>> > goto out_nolock;
>> > --
>> > 2.50.1
>> >
>>
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders
2025-08-19 13:41 ` [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-08-21 12:05 ` Lorenzo Stoakes
2025-08-21 12:33 ` Dev Jain
2025-08-21 16:54 ` Steven Rostedt
0 siblings, 2 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 12:05 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:41:59AM -0600, Nico Pache wrote:
> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> some pages being unmapped. Skip these cases until we have a way to check
> if its ok to collapse to a smaller mTHP size (like in the case of a
> partially mapped folio).
>
> This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
Other than comment below, LGTM so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/khugepaged.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1ad7e00d3fd6..6a4cf7e4a7cc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -611,6 +611,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> folio = page_folio(page);
> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>
> + /*
> + * TODO: In some cases of partially-mapped folios, we'd actually
> + * want to collapse.
> + */
Not a fan of adding todo's in code, they have a habit of being left forever. I'd
maybe put a more written out comment something similar to the commit message.
> + if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
> + result = SCAN_PTE_MAPPED_HUGEPAGE;
> + goto out;
> + }
> +
> /* See collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> ++shared;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders
2025-08-21 12:05 ` Lorenzo Stoakes
@ 2025-08-21 12:33 ` Dev Jain
2025-08-22 10:33 ` Lorenzo Stoakes
2025-08-21 16:54 ` Steven Rostedt
1 sibling, 1 reply; 75+ messages in thread
From: Dev Jain @ 2025-08-21 12:33 UTC (permalink / raw)
To: Lorenzo Stoakes, Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, corbet, rostedt,
mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On 21/08/25 5:35 pm, Lorenzo Stoakes wrote:
>
>
>> ---
>> mm/khugepaged.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 1ad7e00d3fd6..6a4cf7e4a7cc 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -611,6 +611,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> folio = page_folio(page);
>> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>
>> + /*
>> + * TODO: In some cases of partially-mapped folios, we'd actually
>> + * want to collapse.
>> + */
> Not a fan of adding todo's in code, they have a habit of being left forever. I'd
> maybe put a more written out comment something similar to the commit message.
I had suggested to add in https://lore.kernel.org/all/20250211111326.14295-10-dev.jain@arm.com/
from the get go, but then we decided to leave it for later. So rest assured this TODO won't
be left forever : )
>
>> + if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
>> + result = SCAN_PTE_MAPPED_HUGEPAGE;
>> + goto out;
>> + }
>> +
>> /* See collapse_scan_pmd(). */
>> if (folio_maybe_mapped_shared(folio)) {
>> ++shared;
>> --
>> 2.50.1
>>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 09/13] khugepaged: enable collapsing mTHPs even when PMD THPs are disabled
2025-08-19 13:42 ` [PATCH v10 09/13] khugepaged: enable collapsing mTHPs even when PMD THPs are disabled Nico Pache
@ 2025-08-21 13:35 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 13:35 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:42:01AM -0600, Nico Pache wrote:
> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> We have now allowed mTHP collapse, but thp_vma_allowable_order() still only
> checks if the PMD-sized mTHP is allowed to collapse. This prevents scanning
> and collapsing of 64K mTHP when only 64K mTHP is enabled. Thus, we should
> modify the checks to allow all large orders of anonymous mTHP.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7d9b5100bea1..2cadd07341de 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -491,7 +491,11 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> {
> if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> hugepage_pmd_enabled()) {
> - if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> + unsigned long orders = vma_is_anonymous(vma) ?
> + THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
We need some explanation here please, a comment explaining what's going on here
would go a long way.
> +
> + if (thp_vma_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED,
> + orders))
> __khugepaged_enter(vma->vm_mm);
> }
> }
> @@ -2671,6 +2675,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>
> vma_iter_init(&vmi, mm, khugepaged_scan.address);
> for_each_vma(vmi, vma) {
> + unsigned long orders = vma_is_anonymous(vma) ?
> + THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
Can we have this as a separate helper function please? As you're now open-coding
this in two places.
In fact, you can put the comment I mention above there and have that document
what's happening here.
> unsigned long hstart, hend;
>
> cond_resched();
> @@ -2678,7 +2684,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> progress++;
> break;
> }
> - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> + if (!thp_vma_allowable_orders(vma, vma->vm_flags,
> + TVA_KHUGEPAGED, orders)) {
> skip:
> progress++;
> continue;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-21 3:41 ` Wei Yang
@ 2025-08-21 14:09 ` Zi Yan
2025-08-22 10:25 ` Lorenzo Stoakes
0 siblings, 1 reply; 75+ messages in thread
From: Zi Yan @ 2025-08-21 14:09 UTC (permalink / raw)
To: Wei Yang
Cc: Nico Pache, Lorenzo Stoakes, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, david, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 20 Aug 2025, at 23:41, Wei Yang wrote:
> On Wed, Aug 20, 2025 at 09:40:40AM -0600, Nico Pache wrote:
> [...]
>>>
>>>> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>>>> return SCAN_ADDRESS_RANGE;
>>>> - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
>>>> + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
>>>> return SCAN_VMA_CHECK;
>>>> /*
>>>> * Anon VMA expected, the address may be unmapped then
>>>> @@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>> goto out_nolock;
>>>>
>>>> mmap_read_lock(mm);
>>>> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>>>> + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>>> + BIT(HPAGE_PMD_ORDER));
>>>
>>> Shouldn't this be PMD order? Seems equivalent.
>> Yeah i'm actually not sure why we have both... they seem to be the
>> same thing, but perhaps there is some reason for having two...
>
> I am confused with these two, PMD_ORDER above and HPAGE_PMD_ORDER from here.
>
> Do we have a guide on when to use which?
Looking at the definition of HPAGE_PMD_SHIFT in huge_mm.h, it will cause a
build bug when PMD level huge page is not supported. So I think
HPAGE_PMD_ORDER should be used for all huge pages (both THP and hugetlb).
>
>>>
>>>> if (result != SCAN_SUCCEED) {
>>>> mmap_read_unlock(mm);
>>>> goto out_nolock;
>>>> @@ -1168,7 +1170,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>> * mmap_lock.
>>>> */
>>>> mmap_write_lock(mm);
>>>> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>>>> + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>>> + BIT(HPAGE_PMD_ORDER));
>>>> if (result != SCAN_SUCCEED)
>>>> goto out_up_write;
>>>> /* check if the pmd is still valid */
>>>> @@ -2807,7 +2810,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>> mmap_read_lock(mm);
>>>> mmap_locked = true;
>>>> result = hugepage_vma_revalidate(mm, addr, false, &vma,
>>>> - cc);
>>>> + cc, BIT(HPAGE_PMD_ORDER));
>>>> if (result != SCAN_SUCCEED) {
>>>> last_fail = result;
>>>> goto out_nolock;
>>>> --
>>>> 2.50.1
>>>>
>>>
>>
>
> --
> Wei Yang
> Help you, Help me
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
2025-08-19 13:42 ` [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
@ 2025-08-21 14:18 ` Lorenzo Stoakes
2025-08-21 14:26 ` Lorenzo Stoakes
2025-08-22 6:59 ` Baolin Wang
0 siblings, 2 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 14:18 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:42:02AM -0600, Nico Pache wrote:
> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> When only non-PMD-sized mTHP is enabled (such as only 64K mTHP enabled),
I don't think this example is very useful, probably just remove it.
Also 'non-PMD-sized mTHP' implies there is such a thing as PMD-sized mTP :)
> we should also allow kicking khugepaged to attempt scanning and collapsing
What is kicking? I think this should be rephrased to something like 'we should
also allow khugepaged to attempt scanning...'
> 64K mTHP. Modify hugepage_pmd_enabled() to support mTHP collapse, and
64K mTHP -> "of mTHP ranges". Put the 'Modify...' bit in a new paragraph to
be clear.
> while we are at it, rename it to make the function name more clear.
To make this clearer let me suggest:
In order for khugepaged to operate when only mTHP sizes are
specified in sysfs, we must modify the predicate function that
determines whether it ought to run to do so.
This function is currently called hugepage_pmd_enabled(), this
patch renames it to hugepage_enabled() and updates the logic to
check to determine whether any valid orders may exist which would
justify khugepaged running.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2cadd07341de..81d2ffd56ab9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -430,7 +430,7 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
> mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
> }
>
> -static bool hugepage_pmd_enabled(void)
> +static bool hugepage_enabled(void)
> {
> /*
> * We cover the anon, shmem and the file-backed case here; file-backed
> @@ -442,11 +442,11 @@ static bool hugepage_pmd_enabled(void)
The comment above this still references PMD-sized, please make sure to update
comments when you change the described behaviour, as it is now incorrect:
/*
* We cover the anon, shmem and the file-backed case here; file-backed
* hugepages, when configured in, are determined by the global control.
* Anon pmd-sized hugepages are determined by the pmd-size control.
* Shmem pmd-sized hugepages are also determined by its pmd-size control,
* except when the global shmem_huge is set to SHMEM_HUGE_DENY.
*/
Please correct this.
> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> hugepage_global_enabled())
> return true;
> - if (test_bit(PMD_ORDER, &huge_anon_orders_always))
> + if (READ_ONCE(huge_anon_orders_always))
> return true;
> - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
> + if (READ_ONCE(huge_anon_orders_madvise))
> return true;
> - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
> + if (READ_ONCE(huge_anon_orders_inherit) &&
> hugepage_global_enabled())
I guess READ_ONCE() is probably sufficient here as memory ordering isn't
important here, right?
> return true;
> if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
> @@ -490,7 +490,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> vm_flags_t vm_flags)
> {
> if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> - hugepage_pmd_enabled()) {
> + hugepage_enabled()) {
> unsigned long orders = vma_is_anonymous(vma) ?
> THP_ORDERS_ALL_ANON : BIT(PMD_ORDER);
>
> @@ -2762,7 +2762,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
>
> static int khugepaged_has_work(void)
> {
> - return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
> + return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
> }
>
> static int khugepaged_wait_event(void)
> @@ -2835,7 +2835,7 @@ static void khugepaged_wait_work(void)
> return;
> }
>
> - if (hugepage_pmd_enabled())
> + if (hugepage_enabled())
> wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
> }
>
> @@ -2866,7 +2866,7 @@ static void set_recommended_min_free_kbytes(void)
> int nr_zones = 0;
> unsigned long recommended_min;
>
> - if (!hugepage_pmd_enabled()) {
> + if (!hugepage_enabled()) {
> calculate_min_free_kbytes();
> goto update_wmarks;
> }
> @@ -2916,7 +2916,7 @@ int start_stop_khugepaged(void)
> int err = 0;
>
> mutex_lock(&khugepaged_mutex);
> - if (hugepage_pmd_enabled()) {
> + if (hugepage_enabled()) {
> if (!khugepaged_thread)
> khugepaged_thread = kthread_run(khugepaged, NULL,
> "khugepaged");
> @@ -2942,7 +2942,7 @@ int start_stop_khugepaged(void)
> void khugepaged_min_free_kbytes_update(void)
> {
> mutex_lock(&khugepaged_mutex);
> - if (hugepage_pmd_enabled() && khugepaged_thread)
> + if (hugepage_enabled() && khugepaged_thread)
> set_recommended_min_free_kbytes();
> mutex_unlock(&khugepaged_mutex);
> }
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 11/13] khugepaged: improve tracepoints for mTHP orders
2025-08-19 13:42 ` [PATCH v10 11/13] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2025-08-21 14:24 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 14:24 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 07:42:03AM -0600, Nico Pache wrote:
> Add the order to the tracepoints to give better insight into what order
> is being operated at for khugepaged.
NIT: Would be good to list the tracepoints you changed here.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
LGTM to me, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
> mm/khugepaged.c | 10 +++++----
> 2 files changed, 28 insertions(+), 16 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 2305df6cb485..56aa8c3b011b 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -92,34 +92,37 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
>
> TRACE_EVENT(mm_collapse_huge_page,
>
> - TP_PROTO(struct mm_struct *mm, int isolated, int status),
> + TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order),
>
> - TP_ARGS(mm, isolated, status),
> + TP_ARGS(mm, isolated, status, order),
>
> TP_STRUCT__entry(
> __field(struct mm_struct *, mm)
> __field(int, isolated)
> __field(int, status)
> + __field(unsigned int, order)
> ),
>
> TP_fast_assign(
> __entry->mm = mm;
> __entry->isolated = isolated;
> __entry->status = status;
> + __entry->order = order;
> ),
>
> - TP_printk("mm=%p, isolated=%d, status=%s",
> + TP_printk("mm=%p, isolated=%d, status=%s order=%u",
> __entry->mm,
> __entry->isolated,
> - __print_symbolic(__entry->status, SCAN_STATUS))
> + __print_symbolic(__entry->status, SCAN_STATUS),
> + __entry->order)
> );
>
> TRACE_EVENT(mm_collapse_huge_page_isolate,
>
> TP_PROTO(struct folio *folio, int none_or_zero,
> - int referenced, bool writable, int status),
> + int referenced, bool writable, int status, unsigned int order),
>
> - TP_ARGS(folio, none_or_zero, referenced, writable, status),
> + TP_ARGS(folio, none_or_zero, referenced, writable, status, order),
>
> TP_STRUCT__entry(
> __field(unsigned long, pfn)
> @@ -127,6 +130,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
> __field(int, referenced)
> __field(bool, writable)
> __field(int, status)
> + __field(unsigned int, order)
> ),
>
> TP_fast_assign(
> @@ -135,27 +139,31 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
> __entry->referenced = referenced;
> __entry->writable = writable;
> __entry->status = status;
> + __entry->order = order;
> ),
>
> - TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
> + TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s order=%u",
> __entry->pfn,
> __entry->none_or_zero,
> __entry->referenced,
> __entry->writable,
> - __print_symbolic(__entry->status, SCAN_STATUS))
> + __print_symbolic(__entry->status, SCAN_STATUS),
> + __entry->order)
> );
>
> TRACE_EVENT(mm_collapse_huge_page_swapin,
>
> - TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
> + TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
> + unsigned int order),
>
> - TP_ARGS(mm, swapped_in, referenced, ret),
> + TP_ARGS(mm, swapped_in, referenced, ret, order),
>
> TP_STRUCT__entry(
> __field(struct mm_struct *, mm)
> __field(int, swapped_in)
> __field(int, referenced)
> __field(int, ret)
> + __field(unsigned int, order)
> ),
>
> TP_fast_assign(
> @@ -163,13 +171,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
> __entry->swapped_in = swapped_in;
> __entry->referenced = referenced;
> __entry->ret = ret;
> + __entry->order = order;
> ),
>
> - TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
> + TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u",
> __entry->mm,
> __entry->swapped_in,
> __entry->referenced,
> - __entry->ret)
> + __entry->ret,
> + __entry->order)
> );
>
> TRACE_EVENT(mm_khugepaged_scan_file,
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 81d2ffd56ab9..c13bc583a368 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -721,13 +721,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> } else {
> result = SCAN_SUCCEED;
> trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> - referenced, writable, result);
> + referenced, writable, result,
> + order);
> return result;
> }
> out:
> release_pte_pages(pte, _pte, compound_pagelist);
> trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> - referenced, writable, result);
> + referenced, writable, result, order);
> return result;
> }
>
> @@ -1123,7 +1124,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>
> result = SCAN_SUCCEED;
> out:
> - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
> + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
> + order);
> return result;
> }
>
> @@ -1348,7 +1350,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> *mmap_locked = false;
> if (folio)
> folio_put(folio);
> - trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> + trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
> return result;
> }
>
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
2025-08-21 14:18 ` Lorenzo Stoakes
@ 2025-08-21 14:26 ` Lorenzo Stoakes
2025-08-22 6:59 ` Baolin Wang
1 sibling, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 14:26 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
Oh and on this one I forgot to say - this seems to be where we're effectively
turning the whole thing on, right?
I think it's really worth underlining that in the commit message :)
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats
2025-08-19 14:16 ` [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats Nico Pache
@ 2025-08-21 14:47 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 14:47 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Tue, Aug 19, 2025 at 08:16:10AM -0600, Nico Pache wrote:
> With mTHP support inplace, let add the per-order mTHP stats for
> exceeding NONE, SWAP, and SHARED.
>
This is really not enough of a commit message. Exceeding what, where, why,
how? What does 'exceeding' mean here, etc. etc. More words please :)
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
> include/linux/huge_mm.h | 3 +++
> mm/huge_memory.c | 7 +++++++
> mm/khugepaged.c | 16 +++++++++++++---
> 4 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 7ccb93e22852..b85547ac4fe9 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -705,6 +705,23 @@ nr_anon_partially_mapped
> an anonymous THP as "partially mapped" and count it here, even though it
> is not actually partially mapped anymore.
>
> +collapse_exceed_swap_pte
> + The number of anonymous THP which contain at least one swap PTE.
The number of anonymous THP what? Pages? Let's be specific.
> + Currently khugepaged does not support collapsing mTHP regions that
> + contain a swap PTE.
Wait what? So we have a counter for something that's unsupported? That
seems not so useful?
> +
> +collapse_exceed_none_pte
> + The number of anonymous THP which have exceeded the none PTE threshold.
THP pages. What's the 'none PTE threshold'? Do you mean
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none ?
Let's spell that out please, this is far too vague.
> + With mTHP collapse, a bitmap is used to gather the state of a PMD region
> + and is then recursively checked from largest to smallest order against
> + the scaled max_ptes_none count. This counter indicates that the next
> + enabled order will be checked.
I think you really need to expand upon this as this is confusing and vague.
I also don't think saying 'recursive' here really benefits anything, Just
saying that we try to collapse the largest mTHP size we can in each
instance, and then give a more 'words-y' explanation as to how
max_ptes_none is (in effect) converted to a ratio of a PMD, and then that
ratio is applied to the mTHP sizes.
You can then go on to say that this counter measures the number of
occasions in which this occurred.
> +
> +collapse_exceed_shared_pte
> + The number of anonymous THP which contain at least one shared PTE.
anonymous THP pages right? :)
> + Currently khugepaged does not support collapsing mTHP regions that
> + contain a shared PTE.
Again I don't really understand the purpose of creating a counter for
something we don't support.
Let's add it when we support it.
I also in this case and the exceed swap case don't understand what you mean
by exceed here, you need to spell this out clearly.
Perhaps the context missing here is that you _also_ count THP events in
these counters.
But again, given we have THP_... counters for the stats mTHP doesn't do
yet, I'd say adding these is pointless.
> +
> As the system ages, allocating huge pages may be expensive as the
> system uses memory compaction to copy data around memory to free a
> huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4ada5d1f7297..6f1593d0b4b5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
> MTHP_STAT_SPLIT_DEFERRED,
> MTHP_STAT_NR_ANON,
> MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> + MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> + MTHP_STAT_COLLAPSE_EXCEED_NONE,
> + MTHP_STAT_COLLAPSE_EXCEED_SHARED,
Wh do we put 'collapse' here but not in the THP equivalents?
> __MTHP_STAT_COUNT
> };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 20d005c2c61f..9f0470c3e983 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -639,6 +639,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +
>
> static struct attribute *anon_stats_attrs[] = {
> &anon_fault_alloc_attr.attr,
> @@ -655,6 +659,9 @@ static struct attribute *anon_stats_attrs[] = {
> &split_deferred_attr.attr,
> &nr_anon_attr.attr,
> &nr_anon_partially_mapped_attr.attr,
> + &collapse_exceed_swap_pte_attr.attr,
> + &collapse_exceed_none_pte_attr.attr,
> + &collapse_exceed_shared_pte_attr.attr,
> NULL,
> };
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c13bc583a368..5a3386043f39 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -594,7 +594,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> continue;
> } else {
> result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
Hm so wait you were miscounting statistics in patch 10/13 when you turned
all this one? That's not good.
This should be in place _first_ before enabling the feature.
> + if (order == HPAGE_PMD_ORDER)
> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> goto out;
> }
> }
> @@ -633,10 +635,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> * shared may cause a future higher order collapse on a
> * rescan of the same range.
> */
> - if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared)) {
> + if (order != HPAGE_PMD_ORDER) {
Hm wait what? I dont understand what's going on here? You're no longer
actually doing any check except order != HPAGE_PMD_ORDER?... am I missnig
something?
Again why we are bothering to maintain a counter that doesn't mean anything
I don't know? I may be misinterpreting somehow however.
> + result = SCAN_EXCEED_SHARED_PTE;
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> + goto out;
> + }
> +
> + if (cc->is_khugepaged &&
> + shared > khugepaged_max_ptes_shared) {
> result = SCAN_EXCEED_SHARED_PTE;
> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> goto out;
> }
> }
> @@ -1084,6 +1093,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> * range.
> */
> if (order != HPAGE_PMD_ORDER) {
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
This again seems surely to not be testing for what it claims to be
tracking? I may again be missing context here.
> pte_unmap(pte);
> mmap_read_unlock(mm);
> result = SCAN_EXCEED_SWAP_PTE;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (13 preceding siblings ...)
2025-08-19 21:55 ` [PATCH v10 00/13] khugepaged: mTHP support Andrew Morton
@ 2025-08-21 15:01 ` Lorenzo Stoakes
2025-08-21 15:13 ` Dev Jain
2025-09-01 16:21 ` David Hildenbrand
15 siblings, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 15:01 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
OK so I noticed in patch 13/13 (!) where you change the documentation that you
essentially state that the whole method used to determine the ratio of PTEs to
collapse to mTHP is broken:
khugepaged uses max_ptes_none scaled to the order of the enabled
mTHP size to determine collapses. When using mTHPs it's recommended
to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
on 4k page size). This will prevent undesired "creep" behavior that
leads to continuously collapsing to the largest mTHP size; when we
collapse, we are bringing in new non-zero pages that will, on a
subsequent scan, cause the max_ptes_none check of the +1 order to
always be satisfied. By limiting this to less than half the current
order, we make sure we don't cause this feedback
loop. max_ptes_shared and max_ptes_swap have no effect when
collapsing to a mTHP, and mTHP collapse will fail on shared or
swapped out pages.
This seems to me to suggest that using
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
of establishing a 'ratio' to do this calculation is fundamentally flawed.
So surely we ought to introduce a new sysfs tunable for this? Perhaps
/sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
Or something like this?
It's already questionable that we are taking a value that is expressed
essentially in terms of PTE entries per PMD and then use it implicitly to
determine the ratio for mTHP, but to then say 'oh but the default value is
known-broken' is just a blocker for the series in my opinion.
This really has to be done a different way I think.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 13/13] Documentation: mm: update the admin guide for mTHP collapse
2025-08-19 14:17 ` [PATCH v10 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2025-08-21 15:03 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 15:03 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd, Bagas Sanjaya
On Tue, Aug 19, 2025 at 08:17:42AM -0600, Nico Pache wrote:
> Now that we can collapse to mTHPs lets update the admin guide to
> reflect these changes and provide proper guidence on how to utilize it.
>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index b85547ac4fe9..1f9e6a32052c 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -63,7 +63,7 @@ often.
> THP can be enabled system wide or restricted to certain tasks or even
> memory ranges inside task's address space. Unless THP is completely
> disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages.
Maybe worth saying 'of either PMD size or mTHP sizes, if the system is
configured to do so.' to really spell it out.
>
> The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
> interface and using madvise(2) and prctl(2) system calls.
> @@ -149,6 +149,18 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
> sizes, the kernel will select the most appropriate enabled size for a
> given allocation.
>
> +khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size
> +to determine collapses. When using mTHPs it's recommended to set
> +max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page
> +size). This will prevent undesired "creep" behavior that leads to
Woah wait what??
OK sorry but this is crazy - I've sent a reply to the 00/13 patch as this is
really concerning, this suggests to me that this sysctl is completely broken for
this purpose.
I think the series has to be altered to introduce an mthp-specific ratio, anwyay
let's discuss that on the 00/13 reply thread please.
> +continuously collapsing to the largest mTHP size; when we collapse, we are
> +bringing in new non-zero pages that will, on a subsequent scan, cause the
> +max_ptes_none check of the +1 order to always be satisfied. By limiting
> +this to less than half the current order, we make sure we don't cause this
> +feedback loop. max_ptes_shared and max_ptes_swap have no effect when
> +collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out
> +pages.
In general, though I actually want you fundamentally change your approach so you
don't need to document this at all, this whole sentence is far too dense, it
needs paragraphs and some text around it so it's easier to read, right now it's
really quite difficult.
> +
> It's also possible to limit defrag efforts in the VM to generate
> anonymous hugepages in case they're not immediately free to madvise
> regions or to never try to defrag memory and simply fallback to regular
> @@ -264,11 +276,6 @@ support the following arguments::
> Khugepaged controls
> -------------------
>
> -.. note::
> - khugepaged currently only searches for opportunities to collapse to
> - PMD-sized THP and no attempt is made to collapse to other THP
> - sizes.
> -
> khugepaged runs usually at low frequency so while one may not want to
> invoke defrag algorithms synchronously during the page faults, it
> should be worth invoking defrag at least in khugepaged. However it's
> --
> 2.50.1
>
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 15:01 ` Lorenzo Stoakes
@ 2025-08-21 15:13 ` Dev Jain
2025-08-21 15:19 ` Lorenzo Stoakes
2025-08-21 16:38 ` Liam R. Howlett
0 siblings, 2 replies; 75+ messages in thread
From: Dev Jain @ 2025-08-21 15:13 UTC (permalink / raw)
To: Lorenzo Stoakes, Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, corbet, rostedt,
mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> OK so I noticed in patch 13/13 (!) where you change the documentation that you
> essentially state that the whole method used to determine the ratio of PTEs to
> collapse to mTHP is broken:
>
> khugepaged uses max_ptes_none scaled to the order of the enabled
> mTHP size to determine collapses. When using mTHPs it's recommended
> to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> on 4k page size). This will prevent undesired "creep" behavior that
> leads to continuously collapsing to the largest mTHP size; when we
> collapse, we are bringing in new non-zero pages that will, on a
> subsequent scan, cause the max_ptes_none check of the +1 order to
> always be satisfied. By limiting this to less than half the current
> order, we make sure we don't cause this feedback
> loop. max_ptes_shared and max_ptes_swap have no effect when
> collapsing to a mTHP, and mTHP collapse will fail on shared or
> swapped out pages.
>
> This seems to me to suggest that using
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> of establishing a 'ratio' to do this calculation is fundamentally flawed.
>
> So surely we ought to introduce a new sysfs tunable for this? Perhaps
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
>
> Or something like this?
>
> It's already questionable that we are taking a value that is expressed
> essentially in terms of PTE entries per PMD and then use it implicitly to
> determine the ratio for mTHP, but to then say 'oh but the default value is
> known-broken' is just a blocker for the series in my opinion.
>
> This really has to be done a different way I think.
>
> Cheers, Lorenzo
FWIW this was my version of the documentation patch:
https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
The discussion about the creep problem started here:
https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
and the discussion continuing here:
https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
ending with a summary I gave here:
https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
This should help you with the context.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 15:13 ` Dev Jain
@ 2025-08-21 15:19 ` Lorenzo Stoakes
2025-08-21 15:25 ` Nico Pache
2025-08-21 16:38 ` Liam R. Howlett
1 sibling, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 15:19 UTC (permalink / raw)
To: Dev Jain
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
>
> On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > essentially state that the whole method used to determine the ratio of PTEs to
> > collapse to mTHP is broken:
> >
> > khugepaged uses max_ptes_none scaled to the order of the enabled
> > mTHP size to determine collapses. When using mTHPs it's recommended
> > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > on 4k page size). This will prevent undesired "creep" behavior that
> > leads to continuously collapsing to the largest mTHP size; when we
> > collapse, we are bringing in new non-zero pages that will, on a
> > subsequent scan, cause the max_ptes_none check of the +1 order to
> > always be satisfied. By limiting this to less than half the current
> > order, we make sure we don't cause this feedback
> > loop. max_ptes_shared and max_ptes_swap have no effect when
> > collapsing to a mTHP, and mTHP collapse will fail on shared or
> > swapped out pages.
> >
> > This seems to me to suggest that using
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> >
> > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> >
> > Or something like this?
> >
> > It's already questionable that we are taking a value that is expressed
> > essentially in terms of PTE entries per PMD and then use it implicitly to
> > determine the ratio for mTHP, but to then say 'oh but the default value is
> > known-broken' is just a blocker for the series in my opinion.
> >
> > This really has to be done a different way I think.
> >
> > Cheers, Lorenzo
>
> FWIW this was my version of the documentation patch:
> https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
>
> The discussion about the creep problem started here:
> https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
>
> and the discussion continuing here:
> https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
>
> ending with a summary I gave here:
> https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
>
> This should help you with the context.
>
>
Thanks and I"ll have a look, but this series is unmergeable with a broken
default in
/sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
sorry.
We need to have a new tunable as far as I can tell. I also find the use of
this PMD-specific value as an arbitrary way of expressing a ratio pretty
gross.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 15:19 ` Lorenzo Stoakes
@ 2025-08-21 15:25 ` Nico Pache
2025-08-21 15:27 ` Nico Pache
0 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-21 15:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> >
> > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > essentially state that the whole method used to determine the ratio of PTEs to
> > > collapse to mTHP is broken:
> > >
> > > khugepaged uses max_ptes_none scaled to the order of the enabled
> > > mTHP size to determine collapses. When using mTHPs it's recommended
> > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > > on 4k page size). This will prevent undesired "creep" behavior that
> > > leads to continuously collapsing to the largest mTHP size; when we
> > > collapse, we are bringing in new non-zero pages that will, on a
> > > subsequent scan, cause the max_ptes_none check of the +1 order to
> > > always be satisfied. By limiting this to less than half the current
> > > order, we make sure we don't cause this feedback
> > > loop. max_ptes_shared and max_ptes_swap have no effect when
> > > collapsing to a mTHP, and mTHP collapse will fail on shared or
> > > swapped out pages.
> > >
> > > This seems to me to suggest that using
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > >
> > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > >
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > >
> > > Or something like this?
> > >
> > > It's already questionable that we are taking a value that is expressed
> > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > known-broken' is just a blocker for the series in my opinion.
> > >
> > > This really has to be done a different way I think.
> > >
> > > Cheers, Lorenzo
> >
> > FWIW this was my version of the documentation patch:
> > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> >
> > The discussion about the creep problem started here:
> > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> >
> > and the discussion continuing here:
> > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> >
> > ending with a summary I gave here:
> > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> >
> > This should help you with the context.
> >
> >
>
> Thanks and I"ll have a look, but this series is unmergeable with a broken
> default in
> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> sorry.
>
> We need to have a new tunable as far as I can tell. I also find the use of
> this PMD-specific value as an arbitrary way of expressing a ratio pretty
> gross.
The first thing that comes to mind is that we can pin max_ptes_none to
255 if it exceeds 255. It's worth noting that the issue occurs only
for adjacently enabled mTHP sizes.
ie)
if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
temp_max_ptes_none = 255;
>
> Thanks, Lorenzo
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 15:25 ` Nico Pache
@ 2025-08-21 15:27 ` Nico Pache
2025-08-21 15:32 ` Lorenzo Stoakes
0 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-21 15:27 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote:
>
> On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> > >
> > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > > essentially state that the whole method used to determine the ratio of PTEs to
> > > > collapse to mTHP is broken:
> > > >
> > > > khugepaged uses max_ptes_none scaled to the order of the enabled
> > > > mTHP size to determine collapses. When using mTHPs it's recommended
> > > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > > > on 4k page size). This will prevent undesired "creep" behavior that
> > > > leads to continuously collapsing to the largest mTHP size; when we
> > > > collapse, we are bringing in new non-zero pages that will, on a
> > > > subsequent scan, cause the max_ptes_none check of the +1 order to
> > > > always be satisfied. By limiting this to less than half the current
> > > > order, we make sure we don't cause this feedback
> > > > loop. max_ptes_shared and max_ptes_swap have no effect when
> > > > collapsing to a mTHP, and mTHP collapse will fail on shared or
> > > > swapped out pages.
> > > >
> > > > This seems to me to suggest that using
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > > >
> > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > >
> > > > Or something like this?
> > > >
> > > > It's already questionable that we are taking a value that is expressed
> > > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > > known-broken' is just a blocker for the series in my opinion.
> > > >
> > > > This really has to be done a different way I think.
> > > >
> > > > Cheers, Lorenzo
> > >
> > > FWIW this was my version of the documentation patch:
> > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> > >
> > > The discussion about the creep problem started here:
> > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> > >
> > > and the discussion continuing here:
> > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> > >
> > > ending with a summary I gave here:
> > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> > >
> > > This should help you with the context.
> > >
> > >
> >
> > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > default in
> > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > sorry.
> >
> > We need to have a new tunable as far as I can tell. I also find the use of
> > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > gross.
> The first thing that comes to mind is that we can pin max_ptes_none to
> 255 if it exceeds 255. It's worth noting that the issue occurs only
> for adjacently enabled mTHP sizes.
>
> ie)
> if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> temp_max_ptes_none = 255;
Oh and my second point, introducing a new tunable to control mTHP
collapse may become exceedingly complex from a tuning and code
management standpoint.
> >
> > Thanks, Lorenzo
> >
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 15:27 ` Nico Pache
@ 2025-08-21 15:32 ` Lorenzo Stoakes
2025-08-21 16:46 ` Nico Pache
0 siblings, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 15:32 UTC (permalink / raw)
To: Nico Pache
Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 09:27:19AM -0600, Nico Pache wrote:
> On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> > > >
> > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > > > essentially state that the whole method used to determine the ratio of PTEs to
> > > > > collapse to mTHP is broken:
> > > > >
> > > > > khugepaged uses max_ptes_none scaled to the order of the enabled
> > > > > mTHP size to determine collapses. When using mTHPs it's recommended
> > > > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > > > > on 4k page size). This will prevent undesired "creep" behavior that
> > > > > leads to continuously collapsing to the largest mTHP size; when we
> > > > > collapse, we are bringing in new non-zero pages that will, on a
> > > > > subsequent scan, cause the max_ptes_none check of the +1 order to
> > > > > always be satisfied. By limiting this to less than half the current
> > > > > order, we make sure we don't cause this feedback
> > > > > loop. max_ptes_shared and max_ptes_swap have no effect when
> > > > > collapsing to a mTHP, and mTHP collapse will fail on shared or
> > > > > swapped out pages.
> > > > >
> > > > > This seems to me to suggest that using
> > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > > > >
> > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > > > >
> > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > >
> > > > > Or something like this?
> > > > >
> > > > > It's already questionable that we are taking a value that is expressed
> > > > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > > > known-broken' is just a blocker for the series in my opinion.
> > > > >
> > > > > This really has to be done a different way I think.
> > > > >
> > > > > Cheers, Lorenzo
> > > >
> > > > FWIW this was my version of the documentation patch:
> > > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> > > >
> > > > The discussion about the creep problem started here:
> > > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> > > >
> > > > and the discussion continuing here:
> > > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> > > >
> > > > ending with a summary I gave here:
> > > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> > > >
> > > > This should help you with the context.
> > > >
> > > >
> > >
> > > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > > default in
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > sorry.
> > >
> > > We need to have a new tunable as far as I can tell. I also find the use of
> > > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > > gross.
> > The first thing that comes to mind is that we can pin max_ptes_none to
> > 255 if it exceeds 255. It's worth noting that the issue occurs only
> > for adjacently enabled mTHP sizes.
No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
reason, arbitrarily changing this to suit a specific case seems crazy no?
> >
> > ie)
> > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> > temp_max_ptes_none = 255;
> Oh and my second point, introducing a new tunable to control mTHP
> collapse may become exceedingly complex from a tuning and code
> management standpoint.
Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
PMD) 'except please don't set to the usual default when using mTHP' and it's
currently default-broken.
I'm really not sure how that is simpler than a seprate tunable that can be
expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
And we can make anything workable from a code management point of view by
refactoring/developing appropriately.
And given you're now proposing changing the default for even THP pages with a
cap or perhaps having mTHP being used silently change the cap - that is clearly
_far_ worse from a tuning standpoint.
With a new tunable you can just set a sensible default and people don't even
necessarily have to think about it.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 15:13 ` Dev Jain
2025-08-21 15:19 ` Lorenzo Stoakes
@ 2025-08-21 16:38 ` Liam R. Howlett
1 sibling, 0 replies; 75+ messages in thread
From: Liam R. Howlett @ 2025-08-21 16:38 UTC (permalink / raw)
To: Dev Jain
Cc: Lorenzo Stoakes, Nico Pache, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, david, ziy, baolin.wang, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
* Dev Jain <dev.jain@arm.com> [250821 11:14]:
>
> On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > essentially state that the whole method used to determine the ratio of PTEs to
> > collapse to mTHP is broken:
> >
> > khugepaged uses max_ptes_none scaled to the order of the enabled
> > mTHP size to determine collapses. When using mTHPs it's recommended
> > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > on 4k page size). This will prevent undesired "creep" behavior that
> > leads to continuously collapsing to the largest mTHP size; when we
> > collapse, we are bringing in new non-zero pages that will, on a
> > subsequent scan, cause the max_ptes_none check of the +1 order to
> > always be satisfied. By limiting this to less than half the current
> > order, we make sure we don't cause this feedback
> > loop. max_ptes_shared and max_ptes_swap have no effect when
> > collapsing to a mTHP, and mTHP collapse will fail on shared or
> > swapped out pages.
> >
> > This seems to me to suggest that using
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> >
> > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> >
> > Or something like this?
> >
> > It's already questionable that we are taking a value that is expressed
> > essentially in terms of PTE entries per PMD and then use it implicitly to
> > determine the ratio for mTHP, but to then say 'oh but the default value is
> > known-broken' is just a blocker for the series in my opinion.
> >
> > This really has to be done a different way I think.
> >
> > Cheers, Lorenzo
>
> FWIW this was my version of the documentation patch:
> https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
>
> The discussion about the creep problem started here:
> https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
>
> and the discussion continuing here:
> https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
>
> ending with a summary I gave here:
> https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
>
> This should help you with the context.
Thanks for hunting this down, the context should be referenced in the
change log so we can find it easier in the future (and now). Or at
least in the cover letter.
The way the change log in the cover letter is written makes it
exceedingly long. Could you switch to listing the changes from v9 and
links to v1-8 (+RFCs if there are any)? Well, I guess include v10
changes and v1-9 urls..
At the length it is now, it's most likely a tl;dr for most. If you're
starting to review this at v10, then you'd probably appreciate not
rehashing discussions and if you're going from v9 then you already have
an idea of what v10 should have changed.
Said another way, the changelog is more useful with context and context
is difficult to find without a lore link.
I am having issues tracking down the contexts of many items of what has
been generated here and it'll only get worse as time moves on. We do
our best to keep change logs with the necessary details, but having
bread crumbs to follow is extremely helpful for review and in the long
run.
Thanks,
Liam
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 15:32 ` Lorenzo Stoakes
@ 2025-08-21 16:46 ` Nico Pache
2025-08-21 16:54 ` Lorenzo Stoakes
0 siblings, 1 reply; 75+ messages in thread
From: Nico Pache @ 2025-08-21 16:46 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 9:40 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 21, 2025 at 09:27:19AM -0600, Nico Pache wrote:
> > On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> > > > >
> > > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > > > > essentially state that the whole method used to determine the ratio of PTEs to
> > > > > > collapse to mTHP is broken:
> > > > > >
> > > > > > khugepaged uses max_ptes_none scaled to the order of the enabled
> > > > > > mTHP size to determine collapses. When using mTHPs it's recommended
> > > > > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > > > > > on 4k page size). This will prevent undesired "creep" behavior that
> > > > > > leads to continuously collapsing to the largest mTHP size; when we
> > > > > > collapse, we are bringing in new non-zero pages that will, on a
> > > > > > subsequent scan, cause the max_ptes_none check of the +1 order to
> > > > > > always be satisfied. By limiting this to less than half the current
> > > > > > order, we make sure we don't cause this feedback
> > > > > > loop. max_ptes_shared and max_ptes_swap have no effect when
> > > > > > collapsing to a mTHP, and mTHP collapse will fail on shared or
> > > > > > swapped out pages.
> > > > > >
> > > > > > This seems to me to suggest that using
> > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > > > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > > > > >
> > > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > > > > >
> > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > > >
> > > > > > Or something like this?
> > > > > >
> > > > > > It's already questionable that we are taking a value that is expressed
> > > > > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > > > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > > > > known-broken' is just a blocker for the series in my opinion.
> > > > > >
> > > > > > This really has to be done a different way I think.
> > > > > >
> > > > > > Cheers, Lorenzo
> > > > >
> > > > > FWIW this was my version of the documentation patch:
> > > > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> > > > >
> > > > > The discussion about the creep problem started here:
> > > > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> > > > >
> > > > > and the discussion continuing here:
> > > > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> > > > >
> > > > > ending with a summary I gave here:
> > > > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> > > > >
> > > > > This should help you with the context.
> > > > >
> > > > >
> > > >
> > > > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > > > default in
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > sorry.
> > > >
> > > > We need to have a new tunable as far as I can tell. I also find the use of
> > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > > > gross.
> > > The first thing that comes to mind is that we can pin max_ptes_none to
> > > 255 if it exceeds 255. It's worth noting that the issue occurs only
> > > for adjacently enabled mTHP sizes.
>
> No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
> reason, arbitrarily changing this to suit a specific case seems crazy no?
We wouldn't be changing it for PMD collapse, just for the new
behavior. At 511, no mTHP collapses would ever occur anyways, unless
you have 2MB disabled and other mTHP sizes enabled. Technically at 511
only the highest enabled order always gets collapsed.
Ive also argued in the past that 511 is a terrible default for
anything other than thp.enabled=always, but that's a whole other can
of worms we dont need to discuss now.
with this cap of 255, the PMD scan/collapse would work as intended,
then in mTHP collapses we would never introduce this undesired
behavior. We've discussed before that this would be a hard problem to
solve without introducing some expensive way of tracking what has
already been through a collapse, and that doesnt even consider what
happens if things change or are unmapped, and rescanning that section
would be helpful. So having a strictly enforced limit of 255 actually
seems like a good idea to me, as it completely avoids the undesired
behavior and does not require the admins to be aware of such an issue.
Another thought similar to what (IIRC) Dev has mentioned before, if we
have max_ptes_none > 255 then we only consider collapses to the
largest enabled order, that way no creep to the largest enabled order
would occur in the first place, and we would get there straight away.
To me one of these two solutions seem sane in the context of what we
are dealing with.
>
> > >
> > > ie)
> > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> > > temp_max_ptes_none = 255;
> > Oh and my second point, introducing a new tunable to control mTHP
> > collapse may become exceedingly complex from a tuning and code
> > management standpoint.
>
> Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
> PMD) 'except please don't set to the usual default when using mTHP' and it's
> currently default-broken.
>
> I'm really not sure how that is simpler than a seprate tunable that can be
> expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
I agree that the current tunable wasn't designed for this, but we
tried to come up with something that leverages the tunable we have to
avoid new tunables and added complexity.
>
> And we can make anything workable from a code management point of view by
> refactoring/developing appropriately.
What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
(ideally the max number)? seems like we would be saying we want no new
none pages, but also to allow new none pages. To me that seems equally
broken and more confusing than just taking a scale of the current
number (now with a cap).
-- Nico
>
> And given you're now proposing changing the default for even THP pages with a
> cap or perhaps having mTHP being used silently change the cap - that is clearly
> _far_ worse from a tuning standpoint.
>
> With a new tunable you can just set a sensible default and people don't even
> necessarily have to think about it.
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders
2025-08-21 12:05 ` Lorenzo Stoakes
2025-08-21 12:33 ` Dev Jain
@ 2025-08-21 16:54 ` Steven Rostedt
2025-08-21 16:56 ` Lorenzo Stoakes
1 sibling, 1 reply; 75+ messages in thread
From: Steven Rostedt @ 2025-08-21 16:54 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
corbet, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, 21 Aug 2025 13:05:42 +0100
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > + /*
> > + * TODO: In some cases of partially-mapped folios, we'd actually
> > + * want to collapse.
> > + */
>
> Not a fan of adding todo's in code, they have a habit of being left forever.
It's a way to make the developer more depressed by reminding them that they
will never be able to complete their TODO list :-p
Personally, I enjoy the torture of these comments in the code ;-)
-- Steve
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 16:46 ` Nico Pache
@ 2025-08-21 16:54 ` Lorenzo Stoakes
2025-08-21 17:26 ` David Hildenbrand
2025-08-21 20:43 ` David Hildenbrand
0 siblings, 2 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 16:54 UTC (permalink / raw)
To: Nico Pache
Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote:
> > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > > > > default in
> > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > > sorry.
> > > > >
> > > > > We need to have a new tunable as far as I can tell. I also find the use of
> > > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > > > > gross.
> > > > The first thing that comes to mind is that we can pin max_ptes_none to
> > > > 255 if it exceeds 255. It's worth noting that the issue occurs only
> > > > for adjacently enabled mTHP sizes.
> >
> > No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
> > reason, arbitrarily changing this to suit a specific case seems crazy no?
> We wouldn't be changing it for PMD collapse, just for the new
> behavior. At 511, no mTHP collapses would ever occur anyways, unless
> you have 2MB disabled and other mTHP sizes enabled. Technically at 511
> only the highest enabled order always gets collapsed.
>
> Ive also argued in the past that 511 is a terrible default for
> anything other than thp.enabled=always, but that's a whole other can
> of worms we dont need to discuss now.
>
> with this cap of 255, the PMD scan/collapse would work as intended,
> then in mTHP collapses we would never introduce this undesired
> behavior. We've discussed before that this would be a hard problem to
> solve without introducing some expensive way of tracking what has
> already been through a collapse, and that doesnt even consider what
> happens if things change or are unmapped, and rescanning that section
> would be helpful. So having a strictly enforced limit of 255 actually
> seems like a good idea to me, as it completely avoids the undesired
> behavior and does not require the admins to be aware of such an issue.
>
> Another thought similar to what (IIRC) Dev has mentioned before, if we
> have max_ptes_none > 255 then we only consider collapses to the
> largest enabled order, that way no creep to the largest enabled order
> would occur in the first place, and we would get there straight away.
>
> To me one of these two solutions seem sane in the context of what we
> are dealing with.
> >
> > > >
> > > > ie)
> > > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> > > > temp_max_ptes_none = 255;
> > > Oh and my second point, introducing a new tunable to control mTHP
> > > collapse may become exceedingly complex from a tuning and code
> > > management standpoint.
> >
> > Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
> > PMD) 'except please don't set to the usual default when using mTHP' and it's
> > currently default-broken.
> >
> > I'm really not sure how that is simpler than a seprate tunable that can be
> > expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
> I agree that the current tunable wasn't designed for this, but we
> tried to come up with something that leverages the tunable we have to
> avoid new tunables and added complexity.
> >
> > And we can make anything workable from a code management point of view by
> > refactoring/developing appropriately.
> What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
> (ideally the max number)? seems like we would be saying we want no new
> none pages, but also to allow new none pages. To me that seems equally
> broken and more confusing than just taking a scale of the current
> number (now with a cap).
>
>
The one thing we absolutely cannot have is a default that causes this
'creeping' behaviour. This feels like shipping something that is broken and
alluding to it in the documentation.
I spoke to David off-list and he gave some insight into this and perhaps
some reasonable means of avoiding an additional tunable.
I don't want to rehash what he said as I think it's more productive for him
to reply when he has time but broadly I think how we handle this needs
careful consideration.
To me it's clear that some sense of ratio is just immediately very very
confusing, but then again this interface is already confusing, as with much
of THP.
Anyway I'll let David respond here so we don't loop around before he has a
chance to add his input.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders
2025-08-21 16:54 ` Steven Rostedt
@ 2025-08-21 16:56 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21 16:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
corbet, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 12:54:45PM -0400, Steven Rostedt wrote:
> On Thu, 21 Aug 2025 13:05:42 +0100
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > > + /*
> > > + * TODO: In some cases of partially-mapped folios, we'd actually
> > > + * want to collapse.
> > > + */
> >
> > Not a fan of adding todo's in code, they have a habit of being left forever.
>
> It's a way to make the developer more depressed by reminding them that they
> will never be able to complete their TODO list :-p
>
> Personally, I enjoy the torture of these comments in the code ;-)
Well a person must at least _somewhat_ enjoy torture if they decide to do kernel
development (let alone maintainership :P) so this very much figures ;)
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 16:54 ` Lorenzo Stoakes
@ 2025-08-21 17:26 ` David Hildenbrand
2025-08-21 20:43 ` David Hildenbrand
1 sibling, 0 replies; 75+ messages in thread
From: David Hildenbrand @ 2025-08-21 17:26 UTC (permalink / raw)
To: Lorenzo Stoakes, Nico Pache
Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet, rostedt,
mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On 21.08.25 18:54, Lorenzo Stoakes wrote:
> On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote:
>>>>>> Thanks and I"ll have a look, but this series is unmergeable with a broken
>>>>>> default in
>>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
>>>>>> sorry.
>>>>>>
>>>>>> We need to have a new tunable as far as I can tell. I also find the use of
>>>>>> this PMD-specific value as an arbitrary way of expressing a ratio pretty
>>>>>> gross.
>>>>> The first thing that comes to mind is that we can pin max_ptes_none to
>>>>> 255 if it exceeds 255. It's worth noting that the issue occurs only
>>>>> for adjacently enabled mTHP sizes.
>>>
>>> No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
>>> reason, arbitrarily changing this to suit a specific case seems crazy no?
>> We wouldn't be changing it for PMD collapse, just for the new
>> behavior. At 511, no mTHP collapses would ever occur anyways, unless
>> you have 2MB disabled and other mTHP sizes enabled. Technically at 511
>> only the highest enabled order always gets collapsed.
>>
>> Ive also argued in the past that 511 is a terrible default for
>> anything other than thp.enabled=always, but that's a whole other can
>> of worms we dont need to discuss now.
>>
>> with this cap of 255, the PMD scan/collapse would work as intended,
>> then in mTHP collapses we would never introduce this undesired
>> behavior. We've discussed before that this would be a hard problem to
>> solve without introducing some expensive way of tracking what has
>> already been through a collapse, and that doesnt even consider what
>> happens if things change or are unmapped, and rescanning that section
>> would be helpful. So having a strictly enforced limit of 255 actually
>> seems like a good idea to me, as it completely avoids the undesired
>> behavior and does not require the admins to be aware of such an issue.
>>
>> Another thought similar to what (IIRC) Dev has mentioned before, if we
>> have max_ptes_none > 255 then we only consider collapses to the
>> largest enabled order, that way no creep to the largest enabled order
>> would occur in the first place, and we would get there straight away.
>>
>> To me one of these two solutions seem sane in the context of what we
>> are dealing with.
>>>
>>>>>
>>>>> ie)
>>>>> if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
>>>>> temp_max_ptes_none = 255;
>>>> Oh and my second point, introducing a new tunable to control mTHP
>>>> collapse may become exceedingly complex from a tuning and code
>>>> management standpoint.
>>>
>>> Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
>>> PMD) 'except please don't set to the usual default when using mTHP' and it's
>>> currently default-broken.
>>>
>>> I'm really not sure how that is simpler than a seprate tunable that can be
>>> expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
>> I agree that the current tunable wasn't designed for this, but we
>> tried to come up with something that leverages the tunable we have to
>> avoid new tunables and added complexity.
>>>
>>> And we can make anything workable from a code management point of view by
>>> refactoring/developing appropriately.
>> What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
>> (ideally the max number)? seems like we would be saying we want no new
>> none pages, but also to allow new none pages. To me that seems equally
>> broken and more confusing than just taking a scale of the current
>> number (now with a cap).
>>
>>
>
> The one thing we absolutely cannot have is a default that causes this
> 'creeping' behaviour. This feels like shipping something that is broken and
> alluding to it in the documentation.
>
> I spoke to David off-list and he gave some insight into this and perhaps
> some reasonable means of avoiding an additional tunable.
>
> I don't want to rehash what he said as I think it's more productive for him
> to reply when he has time but broadly I think how we handle this needs
> careful consideration.
>
> To me it's clear that some sense of ratio is just immediately very very
> confusing, but then again this interface is already confusing, as with much
> of THP.
>
> Anyway I'll let David respond here so we don't loop around before he has a
> chance to add his input.
I've been summoned.
As raised in the past, I would initially only support specific values here like
0 : Never collapse with any pte_none/zeropage
511 (HPAGE_PMD_NR - 1) / default : Always collapse, ignoring pte_none/zeropage
Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
if we have to add that for now.
Because, as raised in the past, I'm afraid nobody on this earth has a clue how
to set this parameter to values different to 0 (don't waste memory with khugepaged)
and 511 (page fault behavior).
If any other value is set, essentially
pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
for now and just disable it.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 16:54 ` Lorenzo Stoakes
2025-08-21 17:26 ` David Hildenbrand
@ 2025-08-21 20:43 ` David Hildenbrand
2025-08-22 10:41 ` Lorenzo Stoakes
1 sibling, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:43 UTC (permalink / raw)
To: Lorenzo Stoakes, Nico Pache
Cc: Dev Jain, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet, rostedt,
mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
>
> The one thing we absolutely cannot have is a default that causes this
> 'creeping' behaviour. This feels like shipping something that is broken and
> alluding to it in the documentation.
>
> I spoke to David off-list and he gave some insight into this and perhaps
> some reasonable means of avoiding an additional tunable.
>
> I don't want to rehash what he said as I think it's more productive for him
> to reply when he has time but broadly I think how we handle this needs
> careful consideration.
>
> To me it's clear that some sense of ratio is just immediately very very
> confusing, but then again this interface is already confusing, as with much
> of THP.
>
> Anyway I'll let David respond here so we don't loop around before he has a
> chance to add his input.
>
> Cheers, Lorenzo
>
[Resending because Thunderbird decided to use the wrong smtp server]
I've been summoned.
As raised in the past, I would initially only support specific values here like
0 : Never collapse with any pte_none/zeropage
511 (HPAGE_PMD_NR - 1) / default : Always collapse, ignoring pte_none/zeropage
Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
if we have to add that for now.
Because, as raised in the past, I'm afraid nobody on this earth has a clue how
to set this parameter to values different to 0 (don't waste memory with khugepaged)
and 511 (page fault behavior).
If any other value is set, essentially
pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
for now and just disable it.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
2025-08-21 14:18 ` Lorenzo Stoakes
2025-08-21 14:26 ` Lorenzo Stoakes
@ 2025-08-22 6:59 ` Baolin Wang
2025-08-22 7:36 ` Dev Jain
1 sibling, 1 reply; 75+ messages in thread
From: Baolin Wang @ 2025-08-22 6:59 UTC (permalink / raw)
To: Lorenzo Stoakes, Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
Liam.Howlett, ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 2025/8/21 22:18, Lorenzo Stoakes wrote:
> On Tue, Aug 19, 2025 at 07:42:02AM -0600, Nico Pache wrote:
>> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> When only non-PMD-sized mTHP is enabled (such as only 64K mTHP enabled),
>
> I don't think this example is very useful, probably just remove it.
>
> Also 'non-PMD-sized mTHP' implies there is such a thing as PMD-sized mTP :)
>
>> we should also allow kicking khugepaged to attempt scanning and collapsing
>
> What is kicking? I think this should be rephrased to something like 'we should
> also allow khugepaged to attempt scanning...'
>
>> 64K mTHP. Modify hugepage_pmd_enabled() to support mTHP collapse, and
>
> 64K mTHP -> "of mTHP ranges". Put the 'Modify...' bit in a new paragraph to
> be clear.
>
>> while we are at it, rename it to make the function name more clear.
>
> To make this clearer let me suggest:
>
> In order for khugepaged to operate when only mTHP sizes are
> specified in sysfs, we must modify the predicate function that
> determines whether it ought to run to do so.
>
> This function is currently called hugepage_pmd_enabled(), this
> patch renames it to hugepage_enabled() and updates the logic to
> check to determine whether any valid orders may exist which would
> justify khugepaged running.
Thanks. This looks good to me.
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>
>> ---
>> mm/khugepaged.c | 20 ++++++++++----------
>> 1 file changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 2cadd07341de..81d2ffd56ab9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -430,7 +430,7 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>> mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>> }
>>
>> -static bool hugepage_pmd_enabled(void)
>> +static bool hugepage_enabled(void)
>> {
>> /*
>> * We cover the anon, shmem and the file-backed case here; file-backed
>> @@ -442,11 +442,11 @@ static bool hugepage_pmd_enabled(void)
>
> The comment above this still references PMD-sized, please make sure to update
> comments when you change the described behaviour, as it is now incorrect:
>
> /*
> * We cover the anon, shmem and the file-backed case here; file-backed
> * hugepages, when configured in, are determined by the global control.
> * Anon pmd-sized hugepages are determined by the pmd-size control.
> * Shmem pmd-sized hugepages are also determined by its pmd-size control,
> * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
> */
>
> Please correct this.
Sure. How about:
/*
* We cover the anon, shmem and the file-backed case here; file-backed
* hugepages, when configured in, are determined by the global control.
* Anon hugepages are determined by its per-size mTHP control.
* Shmem pmd-sized hugepages are also determined by its pmd-size control,
* except when the global shmem_huge is set to SHMEM_HUGE_DENY.
*/
>> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>> hugepage_global_enabled())
>> return true;
>> - if (test_bit(PMD_ORDER, &huge_anon_orders_always))
>> + if (READ_ONCE(huge_anon_orders_always))
>> return true;
>> - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
>> + if (READ_ONCE(huge_anon_orders_madvise))
>> return true;
>> - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
>> + if (READ_ONCE(huge_anon_orders_inherit) &&
>> hugepage_global_enabled())
>
> I guess READ_ONCE() is probably sufficient here as memory ordering isn't
> important here, right?
Yes, I think so.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs
2025-08-22 6:59 ` Baolin Wang
@ 2025-08-22 7:36 ` Dev Jain
0 siblings, 0 replies; 75+ messages in thread
From: Dev Jain @ 2025-08-22 7:36 UTC (permalink / raw)
To: Baolin Wang, Lorenzo Stoakes, Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 22/08/25 12:29 pm, Baolin Wang wrote:
>
>
> On 2025/8/21 22:18, Lorenzo Stoakes wrote:
>> On Tue, Aug 19, 2025 at 07:42:02AM -0600, Nico Pache wrote:
>>> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>
>>> When only non-PMD-sized mTHP is enabled (such as only 64K mTHP
>>> enabled),
>>
>> I don't think this example is very useful, probably just remove it.
>>
>> Also 'non-PMD-sized mTHP' implies there is such a thing as PMD-sized
>> mTP :)
>>
>>> we should also allow kicking khugepaged to attempt scanning and
>>> collapsing
>>
>> What is kicking? I think this should be rephrased to something like
>> 'we should
>> also allow khugepaged to attempt scanning...'
>>
>>> 64K mTHP. Modify hugepage_pmd_enabled() to support mTHP collapse, and
>>
>> 64K mTHP -> "of mTHP ranges". Put the 'Modify...' bit in a new
>> paragraph to
>> be clear.
>>
>>> while we are at it, rename it to make the function name more clear.
>>
>> To make this clearer let me suggest:
>>
>> In order for khugepaged to operate when only mTHP sizes are
>> specified in sysfs, we must modify the predicate function that
>> determines whether it ought to run to do so.
>>
>> This function is currently called hugepage_pmd_enabled(), this
>> patch renames it to hugepage_enabled() and updates the logic to
>> check to determine whether any valid orders may exist which would
>> justify khugepaged running.
>
> Thanks. This looks good to me.
>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>
>>> ---
>>> mm/khugepaged.c | 20 ++++++++++----------
>>> 1 file changed, 10 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 2cadd07341de..81d2ffd56ab9 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -430,7 +430,7 @@ static inline int
>>> collapse_test_exit_or_disable(struct mm_struct *mm)
>>> mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>>> }
>>>
>>> -static bool hugepage_pmd_enabled(void)
>>> +static bool hugepage_enabled(void)
>>> {
>>> /*
>>> * We cover the anon, shmem and the file-backed case here;
>>> file-backed
>>> @@ -442,11 +442,11 @@ static bool hugepage_pmd_enabled(void)
>>
>> The comment above this still references PMD-sized, please make sure
>> to update
>> comments when you change the described behaviour, as it is now
>> incorrect:
>>
>> /*
>> * We cover the anon, shmem and the file-backed case here;
>> file-backed
>> * hugepages, when configured in, are determined by the global
>> control.
>> * Anon pmd-sized hugepages are determined by the pmd-size control.
>> * Shmem pmd-sized hugepages are also determined by its pmd-size
>> control,
>> * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
>> */
>>
>> Please correct this.
>
> Sure. How about:
>
> /*
> * We cover the anon, shmem and the file-backed case here; file-backed
> * hugepages, when configured in, are determined by the global control.
> * Anon hugepages are determined by its per-size mTHP control.
> * Shmem pmd-sized hugepages are also determined by its pmd-size control,
> * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
> */
Looks good, had done something similar in my version.
>
>>> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>>> hugepage_global_enabled())
>>> return true;
>>> - if (test_bit(PMD_ORDER, &huge_anon_orders_always))
>>> + if (READ_ONCE(huge_anon_orders_always))
>>> return true;
>>> - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
>>> + if (READ_ONCE(huge_anon_orders_madvise))
>>> return true;
>>> - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
>>> + if (READ_ONCE(huge_anon_orders_inherit) &&
>>> hugepage_global_enabled())
>>
>> I guess READ_ONCE() is probably sufficient here as memory ordering isn't
>> important here, right?
>
> Yes, I think so.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2025-08-20 16:35 ` Nico Pache
@ 2025-08-22 10:21 ` Lorenzo Stoakes
2025-08-26 13:30 ` Nico Pache
0 siblings, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-22 10:21 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Wed, Aug 20, 2025 at 10:35:57AM -0600, Nico Pache wrote:
> On Wed, Aug 20, 2025 at 5:22 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Aug 19, 2025 at 07:41:54AM -0600, Nico Pache wrote:
> > > The khugepaged daemon and madvise_collapse have two different
> > > implementations that do almost the same thing.
> > >
> > > Create collapse_single_pmd to increase code reuse and create an entry
> > > point to these two users.
> > >
> > > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > > collapse_single_pmd function. This introduces a minor behavioral change
> > > that is most likely an undiscovered bug. The current implementation of
> > > khugepaged tests collapse_test_exit_or_disable before calling
> > > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > > case. By unifying these two callers madvise_collapse now also performs
> > > this check.
> > >
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > > mm/khugepaged.c | 94 ++++++++++++++++++++++++++-----------------------
> > > 1 file changed, 49 insertions(+), 45 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 0e7bbadf03ee..b7b98aebb670 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -2382,6 +2382,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > return result;
> > > }
> > >
> > > +/*
> > > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > > + * the results.
> > > + */
> > > +static int collapse_single_pmd(unsigned long addr,
> > > + struct vm_area_struct *vma, bool *mmap_locked,
> > > + struct collapse_control *cc)
> > > +{
> > > + int result = SCAN_FAIL;
> >
> > You assign result in all branches, so this can be uninitialised.
> ack, thanks.
> >
> > > + struct mm_struct *mm = vma->vm_mm;
> > > +
> > > + if (!vma_is_anonymous(vma)) {
> > > + struct file *file = get_file(vma->vm_file);
> > > + pgoff_t pgoff = linear_page_index(vma, addr);
> > > +
> > > + mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > > + fput(file);
> > > + if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > > + mmap_read_lock(mm);
> > > + *mmap_locked = true;
> > > + if (collapse_test_exit_or_disable(mm)) {
> > > + mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > + result = SCAN_ANY_PROCESS;
> > > + goto end;
> >
> > Don't love that in e.g. collapse_scan_mm_slot() we are using the mmap lock being
> > disabled as in effect an error code.
> >
> > Is SCAN_ANY_PROCESS correct here? Because in collapse_scan_mm_slot() you'd
> > previously:
> https://lore.kernel.org/lkml/a881ed65-351a-469f-b625-a3066d0f1d5c@linux.alibaba.com/
> Baolin brought up a good point a while back that if
> collapse_test_exit_or_disable returns true we will be breaking out of
> the loop and should change the return value to indicate this. So to
> combine the madvise breakout and the scan_slot breakout we drop the
> lock and return SCAN_ANY_PROCESS.
Let's document in commit msg, as Liam's pointed out it's really important to
track things, and part of that as well is detailing in the commit message what
you're doing + why.
With the THP code being as 'organically grown' as it is shall we say :) it's
even more mportant to be specific.
> >
> > if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > mmap_read_lock(mm);
> > if (collapse_test_exit_or_disable(mm))
> > goto breakouterloop;
> > ...
> > }
> >
> > But now you're setting result = SCAN_ANY_PROCESS rather than
> > SCAN_PTE_MAPPED_HUGEPAGE in this instance?
> >
> > You don't mention that you're changing this, or at least explicitly enough,
> > the commit message should state that you're changing this and explain why
> > it's ok.
> I do state it but perhaps I need to be more verbose! I will update the
> message to state we are also changing the result value too.
Thanks!
> >
> > This whole file is horrid, and it's kinda an aside, but I really wish we
> > had some comment going through each of the scan_result cases and explaining
> > what each one meant.
> Yeah its been a huge pain to have to investigate what everything is
> supposed to mean, and I often have to go searching to confirm things.
> include/trace/events/huge_memory.h has a "good" summary of them
> >
> > Also I think:
> >
> > return SCAN_ANY_PROCESS;
> >
> > Is better than:
> >
> > result = SCAN_ANY_PROCESS;
> > goto end;
> I agree! I will change that :)
> > ...
> > end:
> > return result;
> >
> > > + }
> > > + result = collapse_pte_mapped_thp(mm, addr,
> > > + !cc->is_khugepaged);
> >
> > Hm another change here, in the original code in collapse_scan_mm_slot()
> > this is:
> >
> > *result = collapse_pte_mapped_thp(mm,
> > khugepaged_scan.address, false);
> >
> > Presumably collapse_scan_mm_slot() is only ever invoked with
> > cc->is_khugepaged?
> Correct, but the madvise_collapse calls this with true, hence why it
> now depends on the is_khugepaged variable. No functional change here.
> >
> > Maybe worth adding a VM_WARN_ON_ONCE(!cc->is_khugepaged) at the top of
> > collapse_scan_mm_slot() to assert this (and other places where your change
> > assumes this to be the case).
> Ok I will investigate doing that but it would take a huge mistake to
> hit that assertion.
> >
> >
> > > + if (result == SCAN_PMD_MAPPED)
> > > + result = SCAN_SUCCEED;
> > > + mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > + }
> > > + } else {
> > > + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > > + }
> > > + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > > + ++khugepaged_pages_collapsed;
> >
> > Similarly, presumably because collapse_scan_mm_slot() only ever invoked
> > khugepaged case this didn't have the cc->is_khugepaged check?
> Correct, we only increment this when its khugepaged, so we need to
> guard it so madvise collapse wont increment this.
You know what I'm going to say :) commit message please!
> >
> > > +end:
> > > + return result;
> > > +}
> >
> > There's a LOT of nesting going on here, I think we can simplify this a
> > lot. If we make the change I noted above re: returning SCAN_ANY_PROCESS< we
> > can move the end label up a bit and avoid a ton of nesting, e.g.:
> Ah I like this much more, I will try to implement/test it.
> >
> > static int collapse_single_pmd(unsigned long addr,
> > struct vm_area_struct *vma, bool *mmap_locked,
> > struct collapse_control *cc)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct file *file;
> > pgoff_t pgoff;
> > int result;
> >
> > if (vma_is_anonymous(vma)) {
> > result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > goto end:
> > }
> >
> > file = get_file(vma->vm_file);
> > pgoff = linear_page_index(vma, addr);
> >
> > mmap_read_unlock(mm);
> > *mmap_locked = false;
> > result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > fput(file);
> > if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> > goto end;
> >
> > mmap_read_lock(mm);
> > *mmap_locked = true;
> > if (collapse_test_exit_or_disable(mm)) {
> > mmap_read_unlock(mm);
> > *mmap_locked = false;
> > return SCAN_ANY_PROCESS;
> > }
> > result = collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> > if (result == SCAN_PMD_MAPPED)
> > result = SCAN_SUCCEED;
> > mmap_read_unlock(mm);
> > *mmap_locked = false;
> >
> > end:
> > if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > ++khugepaged_pages_collapsed;
> >
> > return result;
> > }
> >
> > (untested, thrown together so do double check!)
This suggested refactoring work for you?
> >
> > > +
> > > static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> > > struct collapse_control *cc)
> > > __releases(&khugepaged_mm_lock)
> > > @@ -2455,34 +2499,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> > > VM_BUG_ON(khugepaged_scan.address < hstart ||
> > > khugepaged_scan.address + HPAGE_PMD_SIZE >
> > > hend);
> > > - if (!vma_is_anonymous(vma)) {
> > > - struct file *file = get_file(vma->vm_file);
> > > - pgoff_t pgoff = linear_page_index(vma,
> > > - khugepaged_scan.address);
> > > -
> > > - mmap_read_unlock(mm);
> > > - mmap_locked = false;
> > > - *result = collapse_scan_file(mm,
> > > - khugepaged_scan.address, file, pgoff, cc);
> > > - fput(file);
> > > - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > > - mmap_read_lock(mm);
> > > - if (collapse_test_exit_or_disable(mm))
> > > - goto breakouterloop;
> > > - *result = collapse_pte_mapped_thp(mm,
> > > - khugepaged_scan.address, false);
> > > - if (*result == SCAN_PMD_MAPPED)
> > > - *result = SCAN_SUCCEED;
> > > - mmap_read_unlock(mm);
> > > - }
> > > - } else {
> > > - *result = collapse_scan_pmd(mm, vma,
> > > - khugepaged_scan.address, &mmap_locked, cc);
> > > - }
> > > -
> > > - if (*result == SCAN_SUCCEED)
> > > - ++khugepaged_pages_collapsed;
> > >
> > > + *result = collapse_single_pmd(khugepaged_scan.address,
> > > + vma, &mmap_locked, cc);
> > > /* move to next address */
> > > khugepaged_scan.address += HPAGE_PMD_SIZE;
> > > progress += HPAGE_PMD_NR;
> > > @@ -2799,34 +2818,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > > mmap_assert_locked(mm);
> > > memset(cc->node_load, 0, sizeof(cc->node_load));
> > > nodes_clear(cc->alloc_nmask);
> > > - if (!vma_is_anonymous(vma)) {
> > > - struct file *file = get_file(vma->vm_file);
> > > - pgoff_t pgoff = linear_page_index(vma, addr);
> > >
> > > - mmap_read_unlock(mm);
> > > - mmap_locked = false;
> > > - result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > > - fput(file);
> > > - } else {
> > > - result = collapse_scan_pmd(mm, vma, addr,
> > > - &mmap_locked, cc);
> > > - }
> > > + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > > +
> >
> > Ack the fact you noted the behaviour change re:
> > collapse_test_exit_or_disable() that seems fine.
> >
> > > if (!mmap_locked)
> > > *lock_dropped = true;
> > >
> > > -handle_result:
> > > switch (result) {
> > > case SCAN_SUCCEED:
> > > case SCAN_PMD_MAPPED:
> > > ++thps;
> > > break;
> > > - case SCAN_PTE_MAPPED_HUGEPAGE:
> > > - BUG_ON(mmap_locked);
> > > - mmap_read_lock(mm);
> > > - result = collapse_pte_mapped_thp(mm, addr, true);
> > > - mmap_read_unlock(mm);
> > > - goto handle_result;
> >
> > One thing that differs with new code her is we filter SCAN_PMD_MAPPED to
> > SCAN_SUCCEED.
> >
> > I was about to say 'but ++thps - is this correct' but now I realise this
> > was looping back on itself with a goto to do just that... ugh ye gads.
> >
> > Anwyay that's fine because it doesn't change anything.
> >
> > Re: switch statement in general, again would be good to always have each
> > scan possibility in switch statements, but perhaps given so many not
> > practical :)
>
> Yeah it may be worth investigating for future changes I have for
> khugepaged (including the new switch statement I implement later and
> you commented on)
Ack yeah this can be one for the future!
> >
> > (that way the compiler warns on missing a newly added enum val)
> >
> > > /* Whitelisted set of results where continuing OK */
> > > + case SCAN_PTE_MAPPED_HUGEPAGE:
> > > case SCAN_PMD_NULL:
> > > case SCAN_PTE_NON_PRESENT:
> > > case SCAN_PTE_UFFD_WP:
> > > --
>
> Thanks for the review :)
No probs, to underline as well - the critique is to make sure we get this right,
my aim here is to get your series landed in as good a form as possible :)
>
> -- Nico
> > > 2.50.1
> > >
> >
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-21 14:09 ` Zi Yan
@ 2025-08-22 10:25 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-22 10:25 UTC (permalink / raw)
To: Zi Yan
Cc: Wei Yang, Nico Pache, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, david, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On Thu, Aug 21, 2025 at 10:09:10AM -0400, Zi Yan wrote:
> On 20 Aug 2025, at 23:41, Wei Yang wrote:
>
> > On Wed, Aug 20, 2025 at 09:40:40AM -0600, Nico Pache wrote:
> > [...]
> >>>
> >>>> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> >>>> return SCAN_ADDRESS_RANGE;
> >>>> - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> >>>> + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
> >>>> return SCAN_VMA_CHECK;
> >>>> /*
> >>>> * Anon VMA expected, the address may be unmapped then
> >>>> @@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>> goto out_nolock;
> >>>>
> >>>> mmap_read_lock(mm);
> >>>> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> >>>> + result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> >>>> + BIT(HPAGE_PMD_ORDER));
> >>>
> >>> Shouldn't this be PMD order? Seems equivalent.
> >> Yeah i'm actually not sure why we have both... they seem to be the
> >> same thing, but perhaps there is some reason for having two...
> >
> > I am confused with these two, PMD_ORDER above and HPAGE_PMD_ORDER from here.
> >
> > Do we have a guide on when to use which?
>
> Looking at the definition of HPAGE_PMD_SHIFT in huge_mm.h, it will cause a
> build bug when PMD level huge page is not supported. So I think
> HPAGE_PMD_ORDER should be used for all huge pages (both THP and hugetlb).
Thanks yeah that makes sense, so let's keep it all as HPAGE_PMD_ORDER then!
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders
2025-08-21 12:33 ` Dev Jain
@ 2025-08-22 10:33 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-22 10:33 UTC (permalink / raw)
To: Dev Jain
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 06:03:52PM +0530, Dev Jain wrote:
>
> On 21/08/25 5:35 pm, Lorenzo Stoakes wrote:
> >
> >
> > > ---
> > > mm/khugepaged.c | 9 +++++++++
> > > 1 file changed, 9 insertions(+)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 1ad7e00d3fd6..6a4cf7e4a7cc 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -611,6 +611,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > folio = page_folio(page);
> > > VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> > >
> > > + /*
> > > + * TODO: In some cases of partially-mapped folios, we'd actually
> > > + * want to collapse.
> > > + */
> > Not a fan of adding todo's in code, they have a habit of being left forever. I'd
> > maybe put a more written out comment something similar to the commit message.
>
> I had suggested to add in https://lore.kernel.org/all/20250211111326.14295-10-dev.jain@arm.com/
> from the get go, but then we decided to leave it for later. So rest assured this TODO won't
> be left forever : )
:)
I think it's better, despite Steven's sage words :P, to put something a little
more meaningful here like 'currently we don't blah blah because blah blah'
without the TODO ;)
But obviously this isn't a substantially pressing issue... :>)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-21 20:43 ` David Hildenbrand
@ 2025-08-22 10:41 ` Lorenzo Stoakes
2025-08-22 14:10 ` David Hildenbrand
0 siblings, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-22 10:41 UTC (permalink / raw)
To: David Hildenbrand
Cc: Nico Pache, Dev Jain, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Thu, Aug 21, 2025 at 10:43:35PM +0200, David Hildenbrand wrote:
> >
> > The one thing we absolutely cannot have is a default that causes this
> > 'creeping' behaviour. This feels like shipping something that is broken and
> > alluding to it in the documentation.
> >
> > I spoke to David off-list and he gave some insight into this and perhaps
> > some reasonable means of avoiding an additional tunable.
> >
> > I don't want to rehash what he said as I think it's more productive for him
> > to reply when he has time but broadly I think how we handle this needs
> > careful consideration.
> >
> > To me it's clear that some sense of ratio is just immediately very very
> > confusing, but then again this interface is already confusing, as with much
> > of THP.
> >
> > Anyway I'll let David respond here so we don't loop around before he has a
> > chance to add his input.
> >
> > Cheers, Lorenzo
> >
>
> [Resending because Thunderbird decided to use the wrong smtp server]
>
> I've been summoned.
Welcome :)
>
> As raised in the past, I would initially only support specific values here like
>
> 0 : Never collapse with any pte_none/zeropage
> 511 (HPAGE_PMD_NR - 1) / default : Always collapse, ignoring pte_none/zeropage
>
OK so if had effectively an off/on (I guess we have to keep this as it is for
legay purposes) and is forced to one or other of these values then fine (as long
as we don't have uAPI worries).
> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
> if we have to add that for now.
Yeah not so sure about this, this is a 'just have to know' too, and yes you
might add it to the docs, but people are going to be mightily confused, esp if
it's a calculated value.
I don't see any other way around having a separate tunable if we don't just have
something VERY simple like on/off.
Also the mentioned issue sounds like something that needs to be fixed elsewhere
honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
happy to stand corrected if this is somehow inherent, but reallly feels that
way).
>
> Because, as raised in the past, I'm afraid nobody on this earth has a clue how
> to set this parameter to values different to 0 (don't waste memory with khugepaged)
> and 511 (page fault behavior).
Yup
>
>
> If any other value is set, essentially
> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>
> for now and just disable it.
Hmm but under what circumstances? I would just say unsupported value not mention
mTHP or people who don't use mTHP might find that confusing.
>
> --
> Cheers
>
> David / dhildenb
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-22 10:41 ` Lorenzo Stoakes
@ 2025-08-22 14:10 ` David Hildenbrand
2025-08-22 14:49 ` Lorenzo Stoakes
2025-08-28 9:46 ` Baolin Wang
0 siblings, 2 replies; 75+ messages in thread
From: David Hildenbrand @ 2025-08-22 14:10 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Nico Pache, Dev Jain, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
>> if we have to add that for now.
>
> Yeah not so sure about this, this is a 'just have to know' too, and yes you
> might add it to the docs, but people are going to be mightily confused, esp if
> it's a calculated value.
>
> I don't see any other way around having a separate tunable if we don't just have
> something VERY simple like on/off.
Yeah, not advocating that we add support for other values than 0/511,
really.
>
> Also the mentioned issue sounds like something that needs to be fixed elsewhere
> honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
> happy to stand corrected if this is somehow inherent, but reallly feels that
> way).
I think the creep is unavoidable for certain values.
If you have the first two pages of a PMD area populated, and you allow
for at least half of the #PTEs to be non/zero, you'd collapse first a
order-2 folio, then and order-3 ... until you reached PMD order.
So for now we really should just support 0 / 511 to say "don't collapse
if there are holes" vs. "always collapse if there is at least one pte used".
>
>>
>> Because, as raised in the past, I'm afraid nobody on this earth has a clue how
>> to set this parameter to values different to 0 (don't waste memory with khugepaged)
>> and 511 (page fault behavior).
>
> Yup
>
>>
>>
>> If any other value is set, essentially
>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>
>> for now and just disable it.
>
> Hmm but under what circumstances? I would just say unsupported value not mention
> mTHP or people who don't use mTHP might find that confusing.
Well, we can check whether any mTHP size is enabled while the value is
set to something unexpected. We can then even print the problematic
sizes if we have to.
We could also just just say that if the value is set to something else
than 511 (which is the default), it will be treated as being "0" when
collapsing mthp, instead of doing any scaling.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-22 14:10 ` David Hildenbrand
@ 2025-08-22 14:49 ` Lorenzo Stoakes
2025-08-22 15:33 ` Dev Jain
2025-08-28 9:46 ` Baolin Wang
1 sibling, 1 reply; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-22 14:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: Nico Pache, Dev Jain, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote:
> > > Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
> > > if we have to add that for now.
> >
> > Yeah not so sure about this, this is a 'just have to know' too, and yes you
> > might add it to the docs, but people are going to be mightily confused, esp if
> > it's a calculated value.
> >
> > I don't see any other way around having a separate tunable if we don't just have
> > something VERY simple like on/off.
>
> Yeah, not advocating that we add support for other values than 0/511,
> really.
Yeah I'm fine with 0/511.
>
> >
> > Also the mentioned issue sounds like something that needs to be fixed elsewhere
> > honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
> > happy to stand corrected if this is somehow inherent, but reallly feels that
> > way).
>
> I think the creep is unavoidable for certain values.
>
> If you have the first two pages of a PMD area populated, and you allow for
> at least half of the #PTEs to be non/zero, you'd collapse first a
> order-2 folio, then and order-3 ... until you reached PMD order.
Feels like we should be looking at this in reverse? What's the largest, then
next largest, then etc.?
Surely this is the sensible way of doing it?
>
> So for now we really should just support 0 / 511 to say "don't collapse if
> there are holes" vs. "always collapse if there is at least one pte used".
Yes.
>
> >
> > >
> > > Because, as raised in the past, I'm afraid nobody on this earth has a clue how
> > > to set this parameter to values different to 0 (don't waste memory with khugepaged)
> > > and 511 (page fault behavior).
> >
> > Yup
> >
> > >
> > >
> > > If any other value is set, essentially
> > > pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
> > >
> > > for now and just disable it.
> >
> > Hmm but under what circumstances? I would just say unsupported value not mention
> > mTHP or people who don't use mTHP might find that confusing.
>
> Well, we can check whether any mTHP size is enabled while the value is set
> to something unexpected. We can then even print the problematic sizes if we
> have to.
Ack
>
> We could also just just say that if the value is set to something else than
> 511 (which is the default), it will be treated as being "0" when collapsing
> mthp, instead of doing any scaling.
Or we could make it an error to set anything but 0, 511, but on the other hand
that's likely to break userspace so yeah probably not.
Maybe have a warning saying 'this is no longer supported and will be ignored'
then set the value to 0 for anything but 511 or 0.
Then can remove the warning later.
By having 0/511 we can really simplify the 'scaling' logic too which would be
fantastic! :)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-22 14:49 ` Lorenzo Stoakes
@ 2025-08-22 15:33 ` Dev Jain
2025-08-26 10:43 ` Lorenzo Stoakes
0 siblings, 1 reply; 75+ messages in thread
From: Dev Jain @ 2025-08-22 15:33 UTC (permalink / raw)
To: Lorenzo Stoakes, David Hildenbrand
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, baolin.wang, Liam.Howlett, ryan.roberts, corbet, rostedt,
mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On 22/08/25 8:19 pm, Lorenzo Stoakes wrote:
> On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote:
>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
>>>> if we have to add that for now.
>>> Yeah not so sure about this, this is a 'just have to know' too, and yes you
>>> might add it to the docs, but people are going to be mightily confused, esp if
>>> it's a calculated value.
>>>
>>> I don't see any other way around having a separate tunable if we don't just have
>>> something VERY simple like on/off.
>> Yeah, not advocating that we add support for other values than 0/511,
>> really.
> Yeah I'm fine with 0/511.
>
>>> Also the mentioned issue sounds like something that needs to be fixed elsewhere
>>> honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
>>> happy to stand corrected if this is somehow inherent, but reallly feels that
>>> way).
>> I think the creep is unavoidable for certain values.
>>
>> If you have the first two pages of a PMD area populated, and you allow for
>> at least half of the #PTEs to be non/zero, you'd collapse first a
>> order-2 folio, then and order-3 ... until you reached PMD order.
> Feels like we should be looking at this in reverse? What's the largest, then
> next largest, then etc.?
>
> Surely this is the sensible way of doing it?
What David means to say is, for example, suppose all orders are enabled,
and we fail to collapse for order-9, then order-8, then order-7, and so on,
*only* because the distribution of ptes did not obey the scaled max_ptes_none.
Let order-4 collapse succeed.
Next time, khugepaged comes and tries for order-9, fails, then order-8, fails and
so on. Then it checks for order-5, and it comes under the scaled max_ptes_none constraint
only because the previous cycle's order-4 collapse changed the ptes' distribution.
>
>> So for now we really should just support 0 / 511 to say "don't collapse if
>> there are holes" vs. "always collapse if there is at least one pte used".
> Yes.
>
>>>> Because, as raised in the past, I'm afraid nobody on this earth has a clue how
>>>> to set this parameter to values different to 0 (don't waste memory with khugepaged)
>>>> and 511 (page fault behavior).
>>> Yup
>>>
>>>>
>>>> If any other value is set, essentially
>>>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>>
>>>> for now and just disable it.
>>> Hmm but under what circumstances? I would just say unsupported value not mention
>>> mTHP or people who don't use mTHP might find that confusing.
>> Well, we can check whether any mTHP size is enabled while the value is set
>> to something unexpected. We can then even print the problematic sizes if we
>> have to.
> Ack
>
>> We could also just just say that if the value is set to something else than
>> 511 (which is the default), it will be treated as being "0" when collapsing
>> mthp, instead of doing any scaling.
> Or we could make it an error to set anything but 0, 511, but on the other hand
> that's likely to break userspace so yeah probably not.
>
> Maybe have a warning saying 'this is no longer supported and will be ignored'
> then set the value to 0 for anything but 511 or 0.
>
> Then can remove the warning later.
>
> By having 0/511 we can really simplify the 'scaling' logic too which would be
> fantastic! :)
FWIW here was my implementation of this thing, for ease of everyone:
https://lore.kernel.org/all/20250211111326.14295-17-dev.jain@arm.com/
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-19 13:41 ` [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-08-20 13:23 ` Lorenzo Stoakes
@ 2025-08-24 1:37 ` Wei Yang
2025-08-26 13:46 ` Nico Pache
1 sibling, 1 reply; 75+ messages in thread
From: Wei Yang @ 2025-08-24 1:37 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
mhocko, rdunlap, hughd
Hi, Nico
Some nit below.
On Tue, Aug 19, 2025 at 07:41:55AM -0600, Nico Pache wrote:
>For khugepaged to support different mTHP orders, we must generalize this
>to check if the PMD is not shared by another VMA and the order is enabled.
>
>To ensure madvise_collapse can support working on mTHP orders without the
>PMD order enabled, we need to convert hugepage_vma_revalidate to take a
>bitmap of orders.
>
>No functional change in this patch.
>
>Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>Acked-by: David Hildenbrand <david@redhat.com>
>Co-developed-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index b7b98aebb670..2d192ec961d2 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
There is a comment above this function, which says "revalidate vma before
taking mmap_lock".
I am afraid it is "after taking mmap_lock"? or "after taking mmap_lock again"?
>@@ -917,7 +917,7 @@ static int collapse_find_target_node(struct collapse_control *cc)
> static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> bool expect_anon,
> struct vm_area_struct **vmap,
>- struct collapse_control *cc)
>+ struct collapse_control *cc, unsigned long orders)
> {
> struct vm_area_struct *vma;
> enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
>@@ -930,9 +930,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> if (!vma)
> return SCAN_VMA_NULL;
>
>+ /* Always check the PMD order to insure its not shared by another VMA */
> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> return SCAN_ADDRESS_RANGE;
>- if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
>+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
> return SCAN_VMA_CHECK;
> /*
> * Anon VMA expected, the address may be unmapped then
Below is a comment, "thp_vma_allowable_order may return".
Since you use thp_vma_allowable_orders, maybe we need to change the comment
too.
>@@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> goto out_nolock;
>
> mmap_read_lock(mm);
>- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>+ BIT(HPAGE_PMD_ORDER));
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
>@@ -1168,7 +1170,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> * mmap_lock.
> */
> mmap_write_lock(mm);
>- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
>+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>+ BIT(HPAGE_PMD_ORDER));
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
>@@ -2807,7 +2810,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> mmap_read_lock(mm);
> mmap_locked = true;
> result = hugepage_vma_revalidate(mm, addr, false, &vma,
>- cc);
>+ cc, BIT(HPAGE_PMD_ORDER));
> if (result != SCAN_SUCCEED) {
> last_fail = result;
> goto out_nolock;
>--
>2.50.1
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-22 15:33 ` Dev Jain
@ 2025-08-26 10:43 ` Lorenzo Stoakes
0 siblings, 0 replies; 75+ messages in thread
From: Lorenzo Stoakes @ 2025-08-26 10:43 UTC (permalink / raw)
To: Dev Jain
Cc: David Hildenbrand, Nico Pache, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, ziy, baolin.wang, Liam.Howlett, ryan.roberts,
corbet, rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy,
peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Fri, Aug 22, 2025 at 09:03:41PM +0530, Dev Jain wrote:
>
> On 22/08/25 8:19 pm, Lorenzo Stoakes wrote:
> > On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote:
> > > > > Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
> > > > > if we have to add that for now.
> > > > Yeah not so sure about this, this is a 'just have to know' too, and yes you
> > > > might add it to the docs, but people are going to be mightily confused, esp if
> > > > it's a calculated value.
> > > >
> > > > I don't see any other way around having a separate tunable if we don't just have
> > > > something VERY simple like on/off.
> > > Yeah, not advocating that we add support for other values than 0/511,
> > > really.
> > Yeah I'm fine with 0/511.
> >
> > > > Also the mentioned issue sounds like something that needs to be fixed elsewhere
> > > > honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
> > > > happy to stand corrected if this is somehow inherent, but reallly feels that
> > > > way).
> > > I think the creep is unavoidable for certain values.
> > >
> > > If you have the first two pages of a PMD area populated, and you allow for
> > > at least half of the #PTEs to be non/zero, you'd collapse first a
> > > order-2 folio, then and order-3 ... until you reached PMD order.
> > Feels like we should be looking at this in reverse? What's the largest, then
> > next largest, then etc.?
> >
> > Surely this is the sensible way of doing it?
>
> What David means to say is, for example, suppose all orders are enabled,
> and we fail to collapse for order-9, then order-8, then order-7, and so on,
> *only* because the distribution of ptes did not obey the scaled max_ptes_none.
> Let order-4 collapse succeed.
Ah so it is the overhead of this that's the problem?
All roads lead to David's suggestion imo.
> > By having 0/511 we can really simplify the 'scaling' logic too which would be
> > fantastic! :)
>
> FWIW here was my implementation of this thing, for ease of everyone:
> https://lore.kernel.org/all/20250211111326.14295-17-dev.jain@arm.com/
That's fine, but I really think we should just replace all this stuff with a
boolean, and change the interface to max_ptes to set boolean if 511, or clear if
0.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2025-08-22 10:21 ` Lorenzo Stoakes
@ 2025-08-26 13:30 ` Nico Pache
0 siblings, 0 replies; 75+ messages in thread
From: Nico Pache @ 2025-08-26 13:30 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Fri, Aug 22, 2025 at 4:23 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Aug 20, 2025 at 10:35:57AM -0600, Nico Pache wrote:
> > On Wed, Aug 20, 2025 at 5:22 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, Aug 19, 2025 at 07:41:54AM -0600, Nico Pache wrote:
> > > > The khugepaged daemon and madvise_collapse have two different
> > > > implementations that do almost the same thing.
> > > >
> > > > Create collapse_single_pmd to increase code reuse and create an entry
> > > > point to these two users.
> > > >
> > > > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > > > collapse_single_pmd function. This introduces a minor behavioral change
> > > > that is most likely an undiscovered bug. The current implementation of
> > > > khugepaged tests collapse_test_exit_or_disable before calling
> > > > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > > > case. By unifying these two callers madvise_collapse now also performs
> > > > this check.
> > > >
> > > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > Acked-by: David Hildenbrand <david@redhat.com>
> > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > > ---
> > > > mm/khugepaged.c | 94 ++++++++++++++++++++++++++-----------------------
> > > > 1 file changed, 49 insertions(+), 45 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 0e7bbadf03ee..b7b98aebb670 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -2382,6 +2382,50 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > > return result;
> > > > }
> > > >
> > > > +/*
> > > > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > > > + * the results.
> > > > + */
> > > > +static int collapse_single_pmd(unsigned long addr,
> > > > + struct vm_area_struct *vma, bool *mmap_locked,
> > > > + struct collapse_control *cc)
> > > > +{
> > > > + int result = SCAN_FAIL;
> > >
> > > You assign result in all branches, so this can be uninitialised.
> > ack, thanks.
> > >
> > > > + struct mm_struct *mm = vma->vm_mm;
> > > > +
> > > > + if (!vma_is_anonymous(vma)) {
> > > > + struct file *file = get_file(vma->vm_file);
> > > > + pgoff_t pgoff = linear_page_index(vma, addr);
> > > > +
> > > > + mmap_read_unlock(mm);
> > > > + *mmap_locked = false;
> > > > + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > > > + fput(file);
> > > > + if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > > > + mmap_read_lock(mm);
> > > > + *mmap_locked = true;
> > > > + if (collapse_test_exit_or_disable(mm)) {
> > > > + mmap_read_unlock(mm);
> > > > + *mmap_locked = false;
> > > > + result = SCAN_ANY_PROCESS;
> > > > + goto end;
> > >
> > > Don't love that in e.g. collapse_scan_mm_slot() we are using the mmap lock being
> > > disabled as in effect an error code.
> > >
> > > Is SCAN_ANY_PROCESS correct here? Because in collapse_scan_mm_slot() you'd
> > > previously:
> > https://lore.kernel.org/lkml/a881ed65-351a-469f-b625-a3066d0f1d5c@linux.alibaba.com/
> > Baolin brought up a good point a while back that if
> > collapse_test_exit_or_disable returns true we will be breaking out of
> > the loop and should change the return value to indicate this. So to
> > combine the madvise breakout and the scan_slot breakout we drop the
> > lock and return SCAN_ANY_PROCESS.
>
> Let's document in commit msg, as Liam's pointed out it's really important to
> track things, and part of that as well is detailing in the commit message what
> you're doing + why.
ack! thanks
>
> With the THP code being as 'organically grown' as it is shall we say :) it's
> even more mportant to be specific.
>
> > >
> > > if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > > mmap_read_lock(mm);
> > > if (collapse_test_exit_or_disable(mm))
> > > goto breakouterloop;
> > > ...
> > > }
> > >
> > > But now you're setting result = SCAN_ANY_PROCESS rather than
> > > SCAN_PTE_MAPPED_HUGEPAGE in this instance?
> > >
> > > You don't mention that you're changing this, or at least explicitly enough,
> > > the commit message should state that you're changing this and explain why
> > > it's ok.
> > I do state it but perhaps I need to be more verbose! I will update the
> > message to state we are also changing the result value too.
>
> Thanks!
>
> > >
> > > This whole file is horrid, and it's kinda an aside, but I really wish we
> > > had some comment going through each of the scan_result cases and explaining
> > > what each one meant.
> > Yeah its been a huge pain to have to investigate what everything is
> > supposed to mean, and I often have to go searching to confirm things.
> > include/trace/events/huge_memory.h has a "good" summary of them
> > >
> > > Also I think:
> > >
> > > return SCAN_ANY_PROCESS;
> > >
> > > Is better than:
> > >
> > > result = SCAN_ANY_PROCESS;
> > > goto end;
> > I agree! I will change that :)
> > > ...
> > > end:
> > > return result;
> > >
> > > > + }
> > > > + result = collapse_pte_mapped_thp(mm, addr,
> > > > + !cc->is_khugepaged);
> > >
> > > Hm another change here, in the original code in collapse_scan_mm_slot()
> > > this is:
> > >
> > > *result = collapse_pte_mapped_thp(mm,
> > > khugepaged_scan.address, false);
> > >
> > > Presumably collapse_scan_mm_slot() is only ever invoked with
> > > cc->is_khugepaged?
> > Correct, but the madvise_collapse calls this with true, hence why it
> > now depends on the is_khugepaged variable. No functional change here.
> > >
> > > Maybe worth adding a VM_WARN_ON_ONCE(!cc->is_khugepaged) at the top of
> > > collapse_scan_mm_slot() to assert this (and other places where your change
> > > assumes this to be the case).
> > Ok I will investigate doing that but it would take a huge mistake to
> > hit that assertion.
> > >
> > >
> > > > + if (result == SCAN_PMD_MAPPED)
> > > > + result = SCAN_SUCCEED;
> > > > + mmap_read_unlock(mm);
> > > > + *mmap_locked = false;
> > > > + }
> > > > + } else {
> > > > + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > > > + }
> > > > + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > > > + ++khugepaged_pages_collapsed;
> > >
> > > Similarly, presumably because collapse_scan_mm_slot() only ever invoked
> > > khugepaged case this didn't have the cc->is_khugepaged check?
> > Correct, we only increment this when its khugepaged, so we need to
> > guard it so madvise collapse wont increment this.
>
> You know what I'm going to say :) commit message please!
ack, although this isnt anything new. I just needed to add it because
madvise collapse doesnt increment this. Still I'll add a blurb.
>
> > >
> > > > +end:
> > > > + return result;
> > > > +}
> > >
> > > There's a LOT of nesting going on here, I think we can simplify this a
> > > lot. If we make the change I noted above re: returning SCAN_ANY_PROCESS< we
> > > can move the end label up a bit and avoid a ton of nesting, e.g.:
> > Ah I like this much more, I will try to implement/test it.
> > >
> > > static int collapse_single_pmd(unsigned long addr,
> > > struct vm_area_struct *vma, bool *mmap_locked,
> > > struct collapse_control *cc)
> > > {
> > > struct mm_struct *mm = vma->vm_mm;
> > > struct file *file;
> > > pgoff_t pgoff;
> > > int result;
> > >
> > > if (vma_is_anonymous(vma)) {
> > > result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > > goto end:
> > > }
> > >
> > > file = get_file(vma->vm_file);
> > > pgoff = linear_page_index(vma, addr);
> > >
> > > mmap_read_unlock(mm);
> > > *mmap_locked = false;
> > > result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > > fput(file);
> > > if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> > > goto end;
> > >
> > > mmap_read_lock(mm);
> > > *mmap_locked = true;
> > > if (collapse_test_exit_or_disable(mm)) {
> > > mmap_read_unlock(mm);
> > > *mmap_locked = false;
> > > return SCAN_ANY_PROCESS;
> > > }
> > > result = collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> > > if (result == SCAN_PMD_MAPPED)
> > > result = SCAN_SUCCEED;
> > > mmap_read_unlock(mm);
> > > *mmap_locked = false;
> > >
> > > end:
> > > if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > > ++khugepaged_pages_collapsed;
> > >
> > > return result;
> > > }
> > >
> > > (untested, thrown together so do double check!)
>
> This suggested refactoring work for you?
Looks correct, I'm going to implement all the changes then test to
make sure it works as intended.
>
> > >
> > > > +
> > > > static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> > > > struct collapse_control *cc)
> > > > __releases(&khugepaged_mm_lock)
> > > > @@ -2455,34 +2499,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
> > > > VM_BUG_ON(khugepaged_scan.address < hstart ||
> > > > khugepaged_scan.address + HPAGE_PMD_SIZE >
> > > > hend);
> > > > - if (!vma_is_anonymous(vma)) {
> > > > - struct file *file = get_file(vma->vm_file);
> > > > - pgoff_t pgoff = linear_page_index(vma,
> > > > - khugepaged_scan.address);
> > > > -
> > > > - mmap_read_unlock(mm);
> > > > - mmap_locked = false;
> > > > - *result = collapse_scan_file(mm,
> > > > - khugepaged_scan.address, file, pgoff, cc);
> > > > - fput(file);
> > > > - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > > > - mmap_read_lock(mm);
> > > > - if (collapse_test_exit_or_disable(mm))
> > > > - goto breakouterloop;
> > > > - *result = collapse_pte_mapped_thp(mm,
> > > > - khugepaged_scan.address, false);
> > > > - if (*result == SCAN_PMD_MAPPED)
> > > > - *result = SCAN_SUCCEED;
> > > > - mmap_read_unlock(mm);
> > > > - }
> > > > - } else {
> > > > - *result = collapse_scan_pmd(mm, vma,
> > > > - khugepaged_scan.address, &mmap_locked, cc);
> > > > - }
> > > > -
> > > > - if (*result == SCAN_SUCCEED)
> > > > - ++khugepaged_pages_collapsed;
> > > >
> > > > + *result = collapse_single_pmd(khugepaged_scan.address,
> > > > + vma, &mmap_locked, cc);
> > > > /* move to next address */
> > > > khugepaged_scan.address += HPAGE_PMD_SIZE;
> > > > progress += HPAGE_PMD_NR;
> > > > @@ -2799,34 +2818,19 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > > > mmap_assert_locked(mm);
> > > > memset(cc->node_load, 0, sizeof(cc->node_load));
> > > > nodes_clear(cc->alloc_nmask);
> > > > - if (!vma_is_anonymous(vma)) {
> > > > - struct file *file = get_file(vma->vm_file);
> > > > - pgoff_t pgoff = linear_page_index(vma, addr);
> > > >
> > > > - mmap_read_unlock(mm);
> > > > - mmap_locked = false;
> > > > - result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > > > - fput(file);
> > > > - } else {
> > > > - result = collapse_scan_pmd(mm, vma, addr,
> > > > - &mmap_locked, cc);
> > > > - }
> > > > + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > > > +
> > >
> > > Ack the fact you noted the behaviour change re:
> > > collapse_test_exit_or_disable() that seems fine.
> > >
> > > > if (!mmap_locked)
> > > > *lock_dropped = true;
> > > >
> > > > -handle_result:
> > > > switch (result) {
> > > > case SCAN_SUCCEED:
> > > > case SCAN_PMD_MAPPED:
> > > > ++thps;
> > > > break;
> > > > - case SCAN_PTE_MAPPED_HUGEPAGE:
> > > > - BUG_ON(mmap_locked);
> > > > - mmap_read_lock(mm);
> > > > - result = collapse_pte_mapped_thp(mm, addr, true);
> > > > - mmap_read_unlock(mm);
> > > > - goto handle_result;
> > >
> > > One thing that differs with new code her is we filter SCAN_PMD_MAPPED to
> > > SCAN_SUCCEED.
> > >
> > > I was about to say 'but ++thps - is this correct' but now I realise this
> > > was looping back on itself with a goto to do just that... ugh ye gads.
> > >
> > > Anwyay that's fine because it doesn't change anything.
> > >
> > > Re: switch statement in general, again would be good to always have each
> > > scan possibility in switch statements, but perhaps given so many not
> > > practical :)
> >
> > Yeah it may be worth investigating for future changes I have for
> > khugepaged (including the new switch statement I implement later and
> > you commented on)
>
> Ack yeah this can be one for the future!
>
> > >
> > > (that way the compiler warns on missing a newly added enum val)
> > >
> > > > /* Whitelisted set of results where continuing OK */
> > > > + case SCAN_PTE_MAPPED_HUGEPAGE:
> > > > case SCAN_PMD_NULL:
> > > > case SCAN_PTE_NON_PRESENT:
> > > > case SCAN_PTE_UFFD_WP:
> > > > --
> >
> > Thanks for the review :)
>
> No probs, to underline as well - the critique is to make sure we get this right,
> my aim here is to get your series landed in as good a form as possible :)
All critiquing is welcome and appreciated :) The refactoring looks
much better now too!
Cheers,
-- Nico
>
> >
> > -- Nico
> > > > 2.50.1
> > > >
> > >
> >
>
> Cheers, Lorenzo
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2025-08-24 1:37 ` Wei Yang
@ 2025-08-26 13:46 ` Nico Pache
0 siblings, 0 replies; 75+ messages in thread
From: Nico Pache @ 2025-08-26 13:46 UTC (permalink / raw)
To: Wei Yang
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
mhocko, rdunlap, hughd
On Sat, Aug 23, 2025 at 7:37 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> Hi, Nico
>
> Some nit below.
>
> On Tue, Aug 19, 2025 at 07:41:55AM -0600, Nico Pache wrote:
> >For khugepaged to support different mTHP orders, we must generalize this
> >to check if the PMD is not shared by another VMA and the order is enabled.
> >
> >To ensure madvise_collapse can support working on mTHP orders without the
> >PMD order enabled, we need to convert hugepage_vma_revalidate to take a
> >bitmap of orders.
> >
> >No functional change in this patch.
> >
> >Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >Acked-by: David Hildenbrand <david@redhat.com>
> >Co-developed-by: Dev Jain <dev.jain@arm.com>
> >Signed-off-by: Dev Jain <dev.jain@arm.com>
> >Signed-off-by: Nico Pache <npache@redhat.com>
> >---
> > mm/khugepaged.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index b7b98aebb670..2d192ec961d2 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
>
> There is a comment above this function, which says "revalidate vma before
> taking mmap_lock".
>
> I am afraid it is "after taking mmap_lock"? or "after taking mmap_lock again"?
Good catch, never noticed that. I updated the comment!
>
> >@@ -917,7 +917,7 @@ static int collapse_find_target_node(struct collapse_control *cc)
> > static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > bool expect_anon,
> > struct vm_area_struct **vmap,
> >- struct collapse_control *cc)
> >+ struct collapse_control *cc, unsigned long orders)
> > {
> > struct vm_area_struct *vma;
> > enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> >@@ -930,9 +930,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > if (!vma)
> > return SCAN_VMA_NULL;
> >
> >+ /* Always check the PMD order to insure its not shared by another VMA */
> > if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> > return SCAN_ADDRESS_RANGE;
> >- if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> >+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, orders))
> > return SCAN_VMA_CHECK;
> > /*
> > * Anon VMA expected, the address may be unmapped then
>
> Below is a comment, "thp_vma_allowable_order may return".
>
> Since you use thp_vma_allowable_orders, maybe we need to change the comment
> too.
Ack! Thanks for the review!
>
> >@@ -1134,7 +1135,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > goto out_nolock;
> >
> > mmap_read_lock(mm);
> >- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> >+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> >+ BIT(HPAGE_PMD_ORDER));
> > if (result != SCAN_SUCCEED) {
> > mmap_read_unlock(mm);
> > goto out_nolock;
> >@@ -1168,7 +1170,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > * mmap_lock.
> > */
> > mmap_write_lock(mm);
> >- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> >+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> >+ BIT(HPAGE_PMD_ORDER));
> > if (result != SCAN_SUCCEED)
> > goto out_up_write;
> > /* check if the pmd is still valid */
> >@@ -2807,7 +2810,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > mmap_read_lock(mm);
> > mmap_locked = true;
> > result = hugepage_vma_revalidate(mm, addr, false, &vma,
> >- cc);
> >+ cc, BIT(HPAGE_PMD_ORDER));
> > if (result != SCAN_SUCCEED) {
> > last_fail = result;
> > goto out_nolock;
> >--
> >2.50.1
> >
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-22 14:10 ` David Hildenbrand
2025-08-22 14:49 ` Lorenzo Stoakes
@ 2025-08-28 9:46 ` Baolin Wang
2025-08-28 10:48 ` Dev Jain
1 sibling, 1 reply; 75+ messages in thread
From: Baolin Wang @ 2025-08-28 9:46 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Nico Pache, Dev Jain, linux-mm, linux-doc, linux-kernel,
linux-trace-kernel, ziy, Liam.Howlett, ryan.roberts, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
(Sorry for chiming in late)
On 2025/8/22 22:10, David Hildenbrand wrote:
>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>> but not sure
>>> if we have to add that for now.
>>
>> Yeah not so sure about this, this is a 'just have to know' too, and
>> yes you
>> might add it to the docs, but people are going to be mightily
>> confused, esp if
>> it's a calculated value.
>>
>> I don't see any other way around having a separate tunable if we don't
>> just have
>> something VERY simple like on/off.
>
> Yeah, not advocating that we add support for other values than 0/511,
> really.
>
>>
>> Also the mentioned issue sounds like something that needs to be fixed
>> elsewhere
>> honestly in the algorithm used to figure out mTHP ranges (I may be
>> wrong - and
>> happy to stand corrected if this is somehow inherent, but reallly
>> feels that
>> way).
>
> I think the creep is unavoidable for certain values.
>
> If you have the first two pages of a PMD area populated, and you allow
> for at least half of the #PTEs to be non/zero, you'd collapse first a
> order-2 folio, then and order-3 ... until you reached PMD order.
>
> So for now we really should just support 0 / 511 to say "don't collapse
> if there are holes" vs. "always collapse if there is at least one pte
> used".
If we only allow setting 0 or 511, as Nico mentioned before, "At 511, no
mTHP collapses would ever occur anyway, unless you have 2MB disabled and
other mTHP sizes enabled. Technically, at 511, only the highest enabled
order would ever be collapsed."
In other words, for the scenario you described, although there are only
2 PTEs present in a PMD, it would still get collapsed into a PMD-sized
THP. In reality, what we probably need is just an order-2 mTHP collapse.
If 'khugepaged_max_ptes_none' is set to 255, I think this would achieve
the desired result: when there are only 2 PTEs present in a PMD, an
order-2 mTHP collapse would be successed, but it wouldn’t creep up to an
order-3 mTHP collapse. That’s because:
When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while
'bits_set' = 1 (means only 1 chunk is present), so 'bits_set >
threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be
attempted. No?
So I have some concerns that if we only allow setting 0 or 511, it may
not meet the goal we have for mTHP collapsing.
>>> Because, as raised in the past, I'm afraid nobody on this earth has a
>>> clue how
>>> to set this parameter to values different to 0 (don't waste memory
>>> with khugepaged)
>>> and 511 (page fault behavior).
>>
>> Yup
>>
>>>
>>>
>>> If any other value is set, essentially
>>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>
>>> for now and just disable it.
>>
>> Hmm but under what circumstances? I would just say unsupported value
>> not mention
>> mTHP or people who don't use mTHP might find that confusing.
>
> Well, we can check whether any mTHP size is enabled while the value is
> set to something unexpected. We can then even print the problematic
> sizes if we have to.
>
> We could also just just say that if the value is set to something else
> than 511 (which is the default), it will be treated as being "0" when
> collapsing mthp, instead of doing any scaling.
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-28 9:46 ` Baolin Wang
@ 2025-08-28 10:48 ` Dev Jain
2025-08-29 1:55 ` Baolin Wang
0 siblings, 1 reply; 75+ messages in thread
From: Dev Jain @ 2025-08-28 10:48 UTC (permalink / raw)
To: Baolin Wang, David Hildenbrand, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 28/08/25 3:16 pm, Baolin Wang wrote:
> (Sorry for chiming in late)
>
> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>> but not sure
>>>> if we have to add that for now.
>>>
>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>> yes you
>>> might add it to the docs, but people are going to be mightily
>>> confused, esp if
>>> it's a calculated value.
>>>
>>> I don't see any other way around having a separate tunable if we
>>> don't just have
>>> something VERY simple like on/off.
>>
>> Yeah, not advocating that we add support for other values than 0/511,
>> really.
>>
>>>
>>> Also the mentioned issue sounds like something that needs to be
>>> fixed elsewhere
>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>> wrong - and
>>> happy to stand corrected if this is somehow inherent, but reallly
>>> feels that
>>> way).
>>
>> I think the creep is unavoidable for certain values.
>>
>> If you have the first two pages of a PMD area populated, and you
>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>> first a
>> order-2 folio, then and order-3 ... until you reached PMD order.
>>
>> So for now we really should just support 0 / 511 to say "don't
>> collapse if there are holes" vs. "always collapse if there is at
>> least one pte used".
>
> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
> no mTHP collapses would ever occur anyway, unless you have 2MB
> disabled and other mTHP sizes enabled. Technically, at 511, only the
> highest enabled order would ever be collapsed."
I didn't understand this statement. At 511, mTHP collapses will occur if
khugepaged cannot get a PMD folio. Our goal is to collapse to the
highest order folio.
>
> In other words, for the scenario you described, although there are
> only 2 PTEs present in a PMD, it would still get collapsed into a
> PMD-sized THP. In reality, what we probably need is just an order-2
> mTHP collapse.
>
> If 'khugepaged_max_ptes_none' is set to 255, I think this would
> achieve the desired result: when there are only 2 PTEs present in a
> PMD, an order-2 mTHP collapse would be successed, but it wouldn’t
> creep up to an order-3 mTHP collapse. That’s because:
> When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while
> 'bits_set' = 1 (means only 1 chunk is present), so 'bits_set >
> threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be
> attempted. No?
>
> So I have some concerns that if we only allow setting 0 or 511, it may
> not meet the goal we have for mTHP collapsing.
>
>>>> Because, as raised in the past, I'm afraid nobody on this earth has
>>>> a clue how
>>>> to set this parameter to values different to 0 (don't waste memory
>>>> with khugepaged)
>>>> and 511 (page fault behavior).
>>>
>>> Yup
>>>
>>>>
>>>>
>>>> If any other value is set, essentially
>>>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>>
>>>> for now and just disable it.
>>>
>>> Hmm but under what circumstances? I would just say unsupported value
>>> not mention
>>> mTHP or people who don't use mTHP might find that confusing.
>>
>> Well, we can check whether any mTHP size is enabled while the value
>> is set to something unexpected. We can then even print the
>> problematic sizes if we have to.
>>
>> We could also just just say that if the value is set to something
>> else than 511 (which is the default), it will be treated as being "0"
>> when collapsing mthp, instead of doing any scaling.
>>
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-28 10:48 ` Dev Jain
@ 2025-08-29 1:55 ` Baolin Wang
2025-09-01 16:46 ` David Hildenbrand
0 siblings, 1 reply; 75+ messages in thread
From: Baolin Wang @ 2025-08-29 1:55 UTC (permalink / raw)
To: Dev Jain, David Hildenbrand, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 2025/8/28 18:48, Dev Jain wrote:
>
> On 28/08/25 3:16 pm, Baolin Wang wrote:
>> (Sorry for chiming in late)
>>
>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>> but not sure
>>>>> if we have to add that for now.
>>>>
>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>> yes you
>>>> might add it to the docs, but people are going to be mightily
>>>> confused, esp if
>>>> it's a calculated value.
>>>>
>>>> I don't see any other way around having a separate tunable if we
>>>> don't just have
>>>> something VERY simple like on/off.
>>>
>>> Yeah, not advocating that we add support for other values than 0/511,
>>> really.
>>>
>>>>
>>>> Also the mentioned issue sounds like something that needs to be
>>>> fixed elsewhere
>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>> wrong - and
>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>> feels that
>>>> way).
>>>
>>> I think the creep is unavoidable for certain values.
>>>
>>> If you have the first two pages of a PMD area populated, and you
>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>> first a
>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>
>>> So for now we really should just support 0 / 511 to say "don't
>>> collapse if there are holes" vs. "always collapse if there is at
>>> least one pte used".
>>
>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>> no mTHP collapses would ever occur anyway, unless you have 2MB
>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>> highest enabled order would ever be collapsed."
> I didn't understand this statement. At 511, mTHP collapses will occur if
> khugepaged cannot get a PMD folio. Our goal is to collapse to the
> highest order folio.
Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
mean is, as in the example I gave below, users may only want to allow a
large order collapse when the number of present PTEs reaches half of the
large folio, in order to avoid RSS bloat.
So we might also need to consider whether 255 is a reasonable
configuration for mTHP collapse.
>> In other words, for the scenario you described, although there are
>> only 2 PTEs present in a PMD, it would still get collapsed into a PMD-
>> sized THP. In reality, what we probably need is just an order-2 mTHP
>> collapse.
>>
>> If 'khugepaged_max_ptes_none' is set to 255, I think this would
>> achieve the desired result: when there are only 2 PTEs present in a
>> PMD, an order-2 mTHP collapse would be successed, but it wouldn’t
>> creep up to an order-3 mTHP collapse. That’s because:
>> When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while
>> 'bits_set' = 1 (means only 1 chunk is present), so 'bits_set >
>> threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be
>> attempted. No?
>>
>> So I have some concerns that if we only allow setting 0 or 511, it may
>> not meet the goal we have for mTHP collapsing.
>>
>>>>> Because, as raised in the past, I'm afraid nobody on this earth has
>>>>> a clue how
>>>>> to set this parameter to values different to 0 (don't waste memory
>>>>> with khugepaged)
>>>>> and 511 (page fault behavior).
>>>>
>>>> Yup
>>>>
>>>>>
>>>>>
>>>>> If any other value is set, essentially
>>>>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>>>
>>>>> for now and just disable it.
>>>>
>>>> Hmm but under what circumstances? I would just say unsupported value
>>>> not mention
>>>> mTHP or people who don't use mTHP might find that confusing.
>>>
>>> Well, we can check whether any mTHP size is enabled while the value
>>> is set to something unexpected. We can then even print the
>>> problematic sizes if we have to.
>>>
>>> We could also just just say that if the value is set to something
>>> else than 511 (which is the default), it will be treated as being "0"
>>> when collapsing mthp, instead of doing any scaling.
>>>
>>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support
2025-08-20 14:22 ` Lorenzo Stoakes
@ 2025-09-01 16:15 ` David Hildenbrand
0 siblings, 0 replies; 75+ messages in thread
From: David Hildenbrand @ 2025-09-01 16:15 UTC (permalink / raw)
To: Lorenzo Stoakes, Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On 20.08.25 16:22, Lorenzo Stoakes wrote:
> On Tue, Aug 19, 2025 at 07:41:57AM -0600, Nico Pache wrote:
>> generalize the order of the __collapse_huge_page_* functions
>> to support future mTHP collapse.
>>
>> mTHP collapse can suffer from incosistant behavior, and memory waste
>> "creep". disable swapin and shared support for mTHP collapse.
I'd just note that mTHP collapse will initially not honor these two
parameters (spell them out), failing if anything is swapped out or shared.
>>
>> No functional changes in this patch.
>>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 62 ++++++++++++++++++++++++++++++++++---------------
>> 1 file changed, 43 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 77e0d8ee59a0..074101d03c9d 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -551,15 +551,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> unsigned long address,
>> pte_t *pte,
>> struct collapse_control *cc,
>> - struct list_head *compound_pagelist)
>> + struct list_head *compound_pagelist,
>> + unsigned int order)
>
> I think it's better if we keep the output var as the last in the order. It's a
> bit weird to have the order specified here.
Also, while at it, just double-tab indent.
>
>> {
>> struct page *page = NULL;
>> struct folio *folio = NULL;
>> pte_t *_pte;
>> int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>> bool writable = false;
>> + int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>
> This is a weird formulation, I guess we have to go with it to keep things
> consistent-ish, but it's like we have a value for this that is reliant on the
> order always being PMD and then sort of awkwardly adjusting for MTHP.
>
> I guess we're stuck with it though since we have:
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>
> I guess a more sane version of this would be a ratio or something...
Yeah, ratios would have made much more sense ... but people didn't plan
for having something that is not a PMD size.
>
> Anyway probably out of scope here.
>
>>
>> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> + for (_pte = pte; _pte < pte + (1 << order);
>
> Hmm is this correct? I think shifting an int is probably a bad idea even if we
> can get away with it for even PUD order atm (though... 64KB ARM hm), wouldn't
> _BITUL(order) be better?
Just for completeness: we discussed that recently in other context. It
would better be BIT() instead of _BITUL().
But I am not a fan of using BIT() when not working with bitmaps etc.
"1ul << order" etc is used throughout the code base.
What makes this easier to read is:
const unsigned long nr_pages = 1ul << order;
Maybe in the future we want an ORDER_PAGES(), maybe not. Have not made
up my mind yet :)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
` (14 preceding siblings ...)
2025-08-21 15:01 ` Lorenzo Stoakes
@ 2025-09-01 16:21 ` David Hildenbrand
2025-09-01 17:06 ` David Hildenbrand
15 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-09-01 16:21 UTC (permalink / raw)
To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
mhocko, rdunlap, hughd
On 19.08.25 15:41, Nico Pache wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
> pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
> PMD scan is done, we do binary recursion on the bitmap to find the optimal
> mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
> during the scan, to make sure we account for the whole PMD range. When no
> mTHP size is enabled, the legacy behavior of khugepaged is maintained.
> max_ptes_none will be scaled by the attempted collapse order to determine
> how full a mTHP must be to be eligible for the collapse to occur. If a
> mTHP collapse is attempted, but contains swapped out, or shared pages, we
> don't perform the collapse. It is now also possible to collapse to mTHPs
> without requiring the PMD THP size to be enabled.
>
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. When enabling multiple adjacent (m)THP sizes we need to
> set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
> experience collapse "creep" and constantly promote mTHPs to the next
> available size. This is due the fact that a collapse will introduce at
> least 2x the number of pages, and on a future scan will satisfy the
> promotion condition once again.
>
> Patch 1: Refactor/rename hpage_collapse
> Patch 2: Some refactoring to combine madvise_collapse and khugepaged
> Patch 3-5: Generalize khugepaged functions for arbitrary orders
> Patch 6-8: The mTHP patches
> Patch 9-10: Allow khugepaged to operate without PMD enabled
> Patch 11-12: Tracing/stats
> Patch 13: Documentation
Would it be feasible to start with simply not supporting the
max_pte_none parameter in the first version, just like we won't support
max_pte_swapped/max_pte_shared in the first version?
That gives us more time to think about how to use/modify the old interface.
For example, I could envision a ratio-based interface, or as discussed
with Lorenzo a simple boolean. We could make the existing max_ptes*
interface backwards compatible then.
That also gives us the opportunity to think about the creep problem
separately.
I'm sure initial mTHP collapse will be valuable even without support for
that weird set of parameters.
Would there be implementation-wise a problem?
But let me think further about the creep problem ... :/
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-08-29 1:55 ` Baolin Wang
@ 2025-09-01 16:46 ` David Hildenbrand
2025-09-02 2:28 ` Baolin Wang
0 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-09-01 16:46 UTC (permalink / raw)
To: Baolin Wang, Dev Jain, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 29.08.25 03:55, Baolin Wang wrote:
>
>
> On 2025/8/28 18:48, Dev Jain wrote:
>>
>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>> (Sorry for chiming in late)
>>>
>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>> but not sure
>>>>>> if we have to add that for now.
>>>>>
>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>> yes you
>>>>> might add it to the docs, but people are going to be mightily
>>>>> confused, esp if
>>>>> it's a calculated value.
>>>>>
>>>>> I don't see any other way around having a separate tunable if we
>>>>> don't just have
>>>>> something VERY simple like on/off.
>>>>
>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>> really.
>>>>
>>>>>
>>>>> Also the mentioned issue sounds like something that needs to be
>>>>> fixed elsewhere
>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>> wrong - and
>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>> feels that
>>>>> way).
>>>>
>>>> I think the creep is unavoidable for certain values.
>>>>
>>>> If you have the first two pages of a PMD area populated, and you
>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>> first a
>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>
>>>> So for now we really should just support 0 / 511 to say "don't
>>>> collapse if there are holes" vs. "always collapse if there is at
>>>> least one pte used".
>>>
>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>> highest enabled order would ever be collapsed."
>> I didn't understand this statement. At 511, mTHP collapses will occur if
>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>> highest order folio.
>
> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
> mean is, as in the example I gave below, users may only want to allow a
> large order collapse when the number of present PTEs reaches half of the
> large folio, in order to avoid RSS bloat.
How do these users control allocation at fault time where this parameter
is completely ignored?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-09-01 16:21 ` David Hildenbrand
@ 2025-09-01 17:06 ` David Hildenbrand
0 siblings, 0 replies; 75+ messages in thread
From: David Hildenbrand @ 2025-09-01 17:06 UTC (permalink / raw)
To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
dev.jain, corbet, rostedt, mhiramat, mathieu.desnoyers, akpm,
baohua, willy, peterx, wangkefeng.wang, usamaarif642, sunnanyong,
vishal.moola, thomas.hellstrom, yang, kirill.shutemov, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes, rientjes,
mhocko, rdunlap, hughd
On 01.09.25 18:21, David Hildenbrand wrote:
> On 19.08.25 15:41, Nico Pache wrote:
>> The following series provides khugepaged with the capability to collapse
>> anonymous memory regions to mTHPs.
>>
>> To achieve this we generalize the khugepaged functions to no longer depend
>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
>> pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
>> PMD scan is done, we do binary recursion on the bitmap to find the optimal
>> mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
>> during the scan, to make sure we account for the whole PMD range. When no
>> mTHP size is enabled, the legacy behavior of khugepaged is maintained.
>> max_ptes_none will be scaled by the attempted collapse order to determine
>> how full a mTHP must be to be eligible for the collapse to occur. If a
>> mTHP collapse is attempted, but contains swapped out, or shared pages, we
>> don't perform the collapse. It is now also possible to collapse to mTHPs
>> without requiring the PMD THP size to be enabled.
>>
>> With the default max_ptes_none=511, the code should keep its most of its
>> original behavior. When enabling multiple adjacent (m)THP sizes we need to
>> set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
>> experience collapse "creep" and constantly promote mTHPs to the next
>> available size. This is due the fact that a collapse will introduce at
>> least 2x the number of pages, and on a future scan will satisfy the
>> promotion condition once again.
>>
>> Patch 1: Refactor/rename hpage_collapse
>> Patch 2: Some refactoring to combine madvise_collapse and khugepaged
>> Patch 3-5: Generalize khugepaged functions for arbitrary orders
>> Patch 6-8: The mTHP patches
>> Patch 9-10: Allow khugepaged to operate without PMD enabled
>> Patch 11-12: Tracing/stats
>> Patch 13: Documentation
>
> Would it be feasible to start with simply not supporting the
> max_pte_none parameter in the first version, just like we won't support
> max_pte_swapped/max_pte_shared in the first version?
>
> That gives us more time to think about how to use/modify the old interface.
>
> For example, I could envision a ratio-based interface, or as discussed
> with Lorenzo a simple boolean. We could make the existing max_ptes*
> interface backwards compatible then.
>
> That also gives us the opportunity to think about the creep problem
> separately.
>
> I'm sure initial mTHP collapse will be valuable even without support for
> that weird set of parameters.
>
> Would there be implementation-wise a problem?
>
> But let me think further about the creep problem ... :/
FWIW, I just looked around and there is documented usage of setting
max_ptes_none to 0 [1, 2, 3].
In essence, I think it can make sense to set it to 0 when an application
wants to manage THP on its own (MADV_COLLAPSE), and avoid khugepaged
interfering. Now, using a system-wide toggle for such a use case is
rather questionable, but it's all we have.
I did not find anything only recommending to set values different to 0
or 511 -- so far.
So *likely* focusing on 0 vs. 511 initially would cover most use cases
out there. Ignoring the parameter initially (require all to be !none)
could of course also work.
[1] https://www.mongodb.com/docs/manual/administration/tcmalloc-performance/
[2] https://google.github.io/tcmalloc/tuning.html
[3]
https://support.yugabyte.com/hc/en-us/articles/36558155921165-Mitigating-Excessive-RSS-Memory-Usage-Due-to-THP-Transparent-Huge-Pages
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-09-01 16:46 ` David Hildenbrand
@ 2025-09-02 2:28 ` Baolin Wang
2025-09-02 9:03 ` David Hildenbrand
0 siblings, 1 reply; 75+ messages in thread
From: Baolin Wang @ 2025-09-02 2:28 UTC (permalink / raw)
To: David Hildenbrand, Dev Jain, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 2025/9/2 00:46, David Hildenbrand wrote:
> On 29.08.25 03:55, Baolin Wang wrote:
>>
>>
>> On 2025/8/28 18:48, Dev Jain wrote:
>>>
>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>> (Sorry for chiming in late)
>>>>
>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>> but not sure
>>>>>>> if we have to add that for now.
>>>>>>
>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>> yes you
>>>>>> might add it to the docs, but people are going to be mightily
>>>>>> confused, esp if
>>>>>> it's a calculated value.
>>>>>>
>>>>>> I don't see any other way around having a separate tunable if we
>>>>>> don't just have
>>>>>> something VERY simple like on/off.
>>>>>
>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>> really.
>>>>>
>>>>>>
>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>> fixed elsewhere
>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>> wrong - and
>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>> feels that
>>>>>> way).
>>>>>
>>>>> I think the creep is unavoidable for certain values.
>>>>>
>>>>> If you have the first two pages of a PMD area populated, and you
>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>> first a
>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>
>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>> least one pte used".
>>>>
>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>> highest enabled order would ever be collapsed."
>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>> highest order folio.
>>
>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>> mean is, as in the example I gave below, users may only want to allow a
>> large order collapse when the number of present PTEs reaches half of the
>> large folio, in order to avoid RSS bloat.
>
> How do these users control allocation at fault time where this parameter
> is completely ignored?
Sorry, I did not get your point. Why does the 'max_pte_none' need to
control allocation at fault time? Could you be more specific? Thanks.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-09-02 2:28 ` Baolin Wang
@ 2025-09-02 9:03 ` David Hildenbrand
2025-09-02 10:34 ` Usama Arif
0 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-09-02 9:03 UTC (permalink / raw)
To: Baolin Wang, Dev Jain, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap, hughd
On 02.09.25 04:28, Baolin Wang wrote:
>
>
> On 2025/9/2 00:46, David Hildenbrand wrote:
>> On 29.08.25 03:55, Baolin Wang wrote:
>>>
>>>
>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>
>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>> (Sorry for chiming in late)
>>>>>
>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>> but not sure
>>>>>>>> if we have to add that for now.
>>>>>>>
>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>> yes you
>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>> confused, esp if
>>>>>>> it's a calculated value.
>>>>>>>
>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>> don't just have
>>>>>>> something VERY simple like on/off.
>>>>>>
>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>> really.
>>>>>>
>>>>>>>
>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>> fixed elsewhere
>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>> wrong - and
>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>> feels that
>>>>>>> way).
>>>>>>
>>>>>> I think the creep is unavoidable for certain values.
>>>>>>
>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>> first a
>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>
>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>> least one pte used".
>>>>>
>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>> highest enabled order would ever be collapsed."
>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>> highest order folio.
>>>
>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>> mean is, as in the example I gave below, users may only want to allow a
>>> large order collapse when the number of present PTEs reaches half of the
>>> large folio, in order to avoid RSS bloat.
>>
>> How do these users control allocation at fault time where this parameter
>> is completely ignored?
>
> Sorry, I did not get your point. Why does the 'max_pte_none' need to
> control allocation at fault time? Could you be more specific? Thanks.
The comment over khugepaged_max_ptes_none gives a hint:
/*
* default collapse hugepages if there is at least one pte mapped like
* it would have happened if the vma was large enough during page
* fault.
*
* Note that these are only respected if collapse was initiated by khugepaged.
*/
In the common case (for anything that really cares about RSS bloat) you will just a
get a THP during page fault and consequently RSS bloat.
As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
to be when an application later (after once possibly getting a THP already during
page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
MADV_COLLAPSE.
It's a questionable use case, that already got more problematic with mTHP and page
table reclaim.
Let me explain:
Before mTHP, if someone would MADV_DONTNEED (resulting in
a page table with at least one pte_none entry), there would have been no way we would
get memory over-allocated afterwards with max_ptes_none=0.
(1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
(2) khugepaged was told to not collapse through max_ptes_none=0.
But now:
(A) With mTHP during page-faults, we can just end up over-allocating memory in such
an area again: page faults will simply spot a bunch of pte_nones around the fault area
and install an mTHP.
(B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
page table. The next page fault will just try installing a PMD THP again, because there is
no PTE table anymore.
So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-09-02 9:03 ` David Hildenbrand
@ 2025-09-02 10:34 ` Usama Arif
2025-09-02 11:03 ` David Hildenbrand
0 siblings, 1 reply; 75+ messages in thread
From: Usama Arif @ 2025-09-02 10:34 UTC (permalink / raw)
To: David Hildenbrand, Baolin Wang, Dev Jain, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap, hughd
On 02/09/2025 10:03, David Hildenbrand wrote:
> On 02.09.25 04:28, Baolin Wang wrote:
>>
>>
>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>
>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>> (Sorry for chiming in late)
>>>>>>
>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>> but not sure
>>>>>>>>> if we have to add that for now.
>>>>>>>>
>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>> yes you
>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>> confused, esp if
>>>>>>>> it's a calculated value.
>>>>>>>>
>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>> don't just have
>>>>>>>> something VERY simple like on/off.
>>>>>>>
>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>> really.
>>>>>>>
>>>>>>>>
>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>> fixed elsewhere
>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>> wrong - and
>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>> feels that
>>>>>>>> way).
>>>>>>>
>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>
>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>> first a
>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>
>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>> least one pte used".
>>>>>>
>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>> highest enabled order would ever be collapsed."
>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>> highest order folio.
>>>>
>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>> mean is, as in the example I gave below, users may only want to allow a
>>>> large order collapse when the number of present PTEs reaches half of the
>>>> large folio, in order to avoid RSS bloat.
>>>
>>> How do these users control allocation at fault time where this parameter
>>> is completely ignored?
>>
>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>> control allocation at fault time? Could you be more specific? Thanks.
>
> The comment over khugepaged_max_ptes_none gives a hint:
>
> /*
> * default collapse hugepages if there is at least one pte mapped like
> * it would have happened if the vma was large enough during page
> * fault.
> *
> * Note that these are only respected if collapse was initiated by khugepaged.
> */
>
> In the common case (for anything that really cares about RSS bloat) you will just a
> get a THP during page fault and consequently RSS bloat.
>
> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
> to be when an application later (after once possibly getting a THP already during
> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
> MADV_COLLAPSE.
>
> It's a questionable use case, that already got more problematic with mTHP and page
> table reclaim.
>
> Let me explain:
>
> Before mTHP, if someone would MADV_DONTNEED (resulting in
> a page table with at least one pte_none entry), there would have been no way we would
> get memory over-allocated afterwards with max_ptes_none=0.
>
> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
> (2) khugepaged was told to not collapse through max_ptes_none=0.
>
> But now:
>
> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
> an area again: page faults will simply spot a bunch of pte_nones around the fault area
> and install an mTHP.
>
> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
> page table. The next page fault will just try installing a PMD THP again, because there is
> no PTE table anymore.
>
> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>
>
For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
will break down those hugepages and free up zero-filled memory. I have seen in our prod workloads where
the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
of THPs like lower TLB misses.
I do agree that the value of max_ptes_none is magical and different workloads can react very differently
to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
that the memory regression of using THP=always vs THP=madvise is halved.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-09-02 10:34 ` Usama Arif
@ 2025-09-02 11:03 ` David Hildenbrand
2025-09-02 20:23 ` Usama Arif
0 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-09-02 11:03 UTC (permalink / raw)
To: Usama Arif, Baolin Wang, Dev Jain, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap, hughd
On 02.09.25 12:34, Usama Arif wrote:
>
>
> On 02/09/2025 10:03, David Hildenbrand wrote:
>> On 02.09.25 04:28, Baolin Wang wrote:
>>>
>>>
>>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>>
>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>>> (Sorry for chiming in late)
>>>>>>>
>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>>> but not sure
>>>>>>>>>> if we have to add that for now.
>>>>>>>>>
>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>>> yes you
>>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>>> confused, esp if
>>>>>>>>> it's a calculated value.
>>>>>>>>>
>>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>>> don't just have
>>>>>>>>> something VERY simple like on/off.
>>>>>>>>
>>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>>> really.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>>> fixed elsewhere
>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>>> wrong - and
>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>>> feels that
>>>>>>>>> way).
>>>>>>>>
>>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>>
>>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>>> first a
>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>>
>>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>>> least one pte used".
>>>>>>>
>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>>> highest enabled order would ever be collapsed."
>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>>> highest order folio.
>>>>>
>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>>> mean is, as in the example I gave below, users may only want to allow a
>>>>> large order collapse when the number of present PTEs reaches half of the
>>>>> large folio, in order to avoid RSS bloat.
>>>>
>>>> How do these users control allocation at fault time where this parameter
>>>> is completely ignored?
>>>
>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>>> control allocation at fault time? Could you be more specific? Thanks.
>>
>> The comment over khugepaged_max_ptes_none gives a hint:
>>
>> /*
>> * default collapse hugepages if there is at least one pte mapped like
>> * it would have happened if the vma was large enough during page
>> * fault.
>> *
>> * Note that these are only respected if collapse was initiated by khugepaged.
>> */
>>
>> In the common case (for anything that really cares about RSS bloat) you will just a
>> get a THP during page fault and consequently RSS bloat.
>>
>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
>> to be when an application later (after once possibly getting a THP already during
>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
>> MADV_COLLAPSE.
>>
>> It's a questionable use case, that already got more problematic with mTHP and page
>> table reclaim.
>>
>> Let me explain:
>>
>> Before mTHP, if someone would MADV_DONTNEED (resulting in
>> a page table with at least one pte_none entry), there would have been no way we would
>> get memory over-allocated afterwards with max_ptes_none=0.
>>
>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
>> (2) khugepaged was told to not collapse through max_ptes_none=0.
>>
>> But now:
>>
>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>> an area again: page faults will simply spot a bunch of pte_nones around the fault area
>> and install an mTHP.
>>
>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>> page table. The next page fault will just try installing a PMD THP again, because there is
>> no PTE table anymore.
>>
>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>
>>
>
> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
> will break down those hugepages and free up zero-filled memory.
You are not really taming page faults, though, you are undoing what page
faults might have messed up :)
I have seen in our prod workloads where
> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
> of THPs like lower TLB misses.
Thanks for raising that: I think the current behavior is in place such
that you don't bounce back-and-forth between khugepaged collapse and
shrinker-split.
There are likely other ways to achieve that, when we have in mind that
the thp shrinker will install zero pages and max_ptes_none includes
zero pages.
>
> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
> that the memory regression of using THP=always vs THP=madvise is halved.
To which value would you set it? Just 510? 0?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 06/13] khugepaged: add mTHP support
2025-08-20 18:29 ` Lorenzo Stoakes
@ 2025-09-02 20:12 ` Nico Pache
0 siblings, 0 replies; 75+ messages in thread
From: Nico Pache @ 2025-09-02 20:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, corbet,
rostedt, mhiramat, mathieu.desnoyers, akpm, baohua, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, surenb, zokeefe, hannes, rientjes, mhocko,
rdunlap, hughd
On Wed, Aug 20, 2025 at 12:30 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 19, 2025 at 07:41:58AM -0600, Nico Pache wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we don't bailout early and miss potential mTHP candidates.
> >
> > A new function collapse_scan_bitmap is used to perform binary recursion on
> > the bitmap and determine the best eligible order for the collapse.
> > A stack struct is used instead of traditional recursion. max_ptes_none
> > will be scaled by the attempted collapse order to determine how "full" an
> > order must be before being considered for collapse.
> >
> > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > order supported by anon memory.
> >
> > For orders configured with "always", we perform greedy collapsing
> > to that order without considering bit density.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we don't perform the collapse. This is because adding new entries
> > can lead to new none pages, and these may lead to constant promotion into
> > a higher order (m)THP. A similar issue can occur with "max_ptes_none >
> > HPAGE_PMD_NR/2" due to the fact that a collapse will introduce at least 2x
> > the number of pages, and on a future scan will satisfy the promotion
> > condition once again.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the non-PMD case this is not true, and we must keep the lock to prevent
> > changes to the VMA from occurring.
> >
> > Currently madv_collapse is not supported and will only attempt PMD
> > collapse.
>
> Yes I think this has to remain the case unfortunately as we override
> sysfs-specified orders for MADV_COLLAPSE and there's no sensible way to
> determine what order we ought to be using.
>
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> You've gone from small incremental changes to a huge one here... for the
> sake of reviewer sanity at least, any chance of breaking this up?
I had this as two patches (one for the bitmap and one for implementing
it), but I was asked to squash them :/
>
> > ---
> > include/linux/khugepaged.h | 4 +
> > mm/khugepaged.c | 236 +++++++++++++++++++++++++++++--------
> > 2 files changed, 188 insertions(+), 52 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index eb1946a70cff..d12cdb9ef3ba 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -1,6 +1,10 @@
> > /* SPDX-License-Identifier: GPL-2.0 */
> > #ifndef _LINUX_KHUGEPAGED_H
> > #define _LINUX_KHUGEPAGED_H
> > +#define KHUGEPAGED_MIN_MTHP_ORDER 2
>
> I guess this makes sense as by definition 2 pages is least it could
> possibly be.
Order, so 4 pages, 16kB mTHP
>
> > +#define KHUGEPAGED_MIN_MTHP_NR (1 << KHUGEPAGED_MIN_MTHP_ORDER)
>
> Surely KHUGEPAGED_MIN_NR_MTHP_PTES would be more meaningful?
Sure!
>
> > +#define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
>
> This is confusing - size of what?
We need it like this due to ppc64 (and maybe others?), it used to be
based on PMD_ORDER, but some arches fail to compile due to the PMD
size only being known at boot time.
This compiles to 9 on arches that have 512 ptes.
so 1 << (9 - 2) == 128
>
> If it's number of bits surely this should be ilog2(MAX_PTRS_PER_PTE) -
> KHUGEPAGED_MIN_MTHP_ORDER?
This would only be 7? We need a 128 bit bitmap
>
> This seems to be more so 'the maximum value that could contain the bits right?
>
> I think this is just wrong though, see below at DECLARE_BITMAP() stuff.
>
> > +#define MTHP_BITMAP_SIZE (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER))
>
> Hard to know how this relates to MAX_MTHP_BITMAP_SIZE?
>
> I guess this is the current bitmap size indicating all that is possible,
> but if these are all #define's what is this accomplishing?
One for compile time one for runtime. Kind of annoying but it was the
easiest solution given the architecture limitations.
>
> For all - please do not do (1 << xxx)! This can lead to sign-extension bugs at least
> in theory, use _BITUL(...), it's neater too.
ack, thanks!
>
> NIT but the whitespace is all screwed up here.
>
> KHUGEPAGED_MIN_MTHP_ORDER and KHUGEPAGED_MIN_MTHP_NR
>
> >
> > #include <linux/mm.h>
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 074101d03c9d..1ad7e00d3fd6 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> > static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +struct scan_bit_state {
> > + u8 order;
> > + u16 offset;
> > +};
> > +
> > struct collapse_control {
> > bool is_khugepaged;
> >
> > @@ -102,6 +107,18 @@ struct collapse_control {
> >
> > /* nodemask for allocation fallback */
> > nodemask_t alloc_nmask;
> > +
> > + /*
> > + * bitmap used to collapse mTHP sizes.
> > + * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP
>
> I'm not sure what this '1bit = xxx' comment means?
A single bit represents 1 << MIN_MTHP_ORDER (4) pages. Ill express that better
>
> > + */
> > + DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
>
> Hmm this seems wrong.
Should be a bitmap with 128 bits (for 4k page size). Not sure what's wrong here.
>
> DECLARE_BITMAP(..., val) is expessed as:
>
> #define DECLARE_BITMAP(name,bits) \
> unsigned long name[BITS_TO_LONGS(bits)]
>
> So the 2nd param should be number of bits.
>
> But MAX_MTHP_BITMAP_SIZE is:
>
> (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
>
> So typically:
>
> (1 << (9 - 2)) = 128
>
> And BITS_TO_LONGS is defined as:
>
> __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
>
> So essentially this will be 128 / 8 on a 64-bit system so 16 bytes to
> store... 7 bits?
I think you mean 64. 8 would be BYTES_PER_TYPE
>
> Unless I'm missing something here?
Hmm, unless the DECLARE_BITMAP is being used incorrectly in multiple
places, DECLARE_BITMAP(..., # of bits) is how this is intended to be
used.
I think it's an array of unsigned longs, so each part of the name[] is
already 64 bits. hence the divide.
>
> > + DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>
> Same comment as above obviously. But also this is kind of horrible, why are
> we putting a copy of this entire bitmap on the stack every time we declare
> a cc?
The temp one is used as a scratch pad, Baolin also finds this useful
in his file mTHP collapse useful for another use as well.
In general khugepaged always uses the same CC, so it doesn't not
having to constantly allocate this.
>
> > + struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE];
> > +};
> > +
> > +struct collapse_control khugepaged_collapse_control = {
> > + .is_khugepaged = true,
> > };
>
> Why are we moving this here?
Because if not it doesn't compile.
>
> >
> > /**
> > @@ -854,10 +871,6 @@ static void khugepaged_alloc_sleep(void)
> > remove_wait_queue(&khugepaged_wait, &wait);
> > }
> >
> > -struct collapse_control khugepaged_collapse_control = {
> > - .is_khugepaged = true,
> > -};
> > -
> > static bool collapse_scan_abort(int nid, struct collapse_control *cc)
> > {
> > int i;
> > @@ -1136,17 +1149,19 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >
> > static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > int referenced, int unmapped,
> > - struct collapse_control *cc)
> > + struct collapse_control *cc, bool *mmap_locked,
> > + unsigned int order, unsigned long offset)
> > {
> > LIST_HEAD(compound_pagelist);
> > pmd_t *pmd, _pmd;
> > - pte_t *pte;
> > + pte_t *pte = NULL, mthp_pte;
> > pgtable_t pgtable;
> > struct folio *folio;
> > spinlock_t *pmd_ptl, *pte_ptl;
> > int result = SCAN_FAIL;
> > struct vm_area_struct *vma;
> > struct mmu_notifier_range range;
> > + unsigned long _address = address + offset * PAGE_SIZE;
>
> This name is really horrible. please name it sensibly.
>
> It feels like address ought to be consistently the base of the THP or mTHP
> we wish to collapse, and if we need something PMD aligned for some reason
> we should rename _that_ to e.g. pmd_address.
>
> Orrr it could be mthp_address...
>
> Perhaps we could just figure that out here and pass only the
> address... aligning to PMD boundary shouldn't be hard/costly.
>
> But it may indicate we need further refactorisation so we don't need to
> paper over cracks + pass around a PMD address to do things when that may
> not be where the (m)THP range begins.
Ok i'll rename them, but we still need to know the PMD address as we
rely on it for a few key operations.
Can we leave _address and rename address to pmd_address?
>
> >
> > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > @@ -1155,16 +1170,20 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > * The allocation can take potentially a long time if it involves
> > * sync compaction, and we do not need to hold the mmap_lock during
> > * that. We will recheck the vma after taking it again in write mode.
> > + * If collapsing mTHPs we may have already released the read_lock.
> > */
> > - mmap_read_unlock(mm);
> > + if (*mmap_locked) {
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + }
> >
> > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > + result = alloc_charge_folio(&folio, mm, cc, order);
> > if (result != SCAN_SUCCEED)
> > goto out_nolock;
> >
> > mmap_read_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > - BIT(HPAGE_PMD_ORDER));
> > + *mmap_locked = true;
> > + result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
>
> I mean this is kind of going back to previous commits, but it's really ugly
> to pass a BIT(xxx) here, is that really necessary? Can't we just pass in
> the order?
Yes and no... currently we only ever pass the bit of the current order
so we could get away with it, but to generalize it we want the ability
to pass a bitmap of the available orders. Like in the case of future
madvise_collapse support, we would need to pass a bitmap of possible
orders.
>
> It's also inconsistent with other calls like
> e.g. __collapse_huge_page_swapin() below which passes the order.
>
> Same goes obv. for all such invocations.
>
> > if (result != SCAN_SUCCEED) {
> > mmap_read_unlock(mm);
> > goto out_nolock;
> > @@ -1182,13 +1201,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > * released when it fails. So we jump out_nolock directly in
> > * that case. Continuing to collapse causes inconsistency.
> > */
> > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > - referenced, HPAGE_PMD_ORDER);
> > + result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > + referenced, order);
> > if (result != SCAN_SUCCEED)
> > goto out_nolock;
> > }
> >
> > mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > /*
> > * Prevent all access to pagetables with the exception of
> > * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1198,8 +1218,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > * mmap_lock.
> > */
> > mmap_write_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > - BIT(HPAGE_PMD_ORDER));
> > + result = hugepage_vma_revalidate(mm, address, true, &vma, cc, BIT(order));
> > if (result != SCAN_SUCCEED)
> > goto out_up_write;
> > /* check if the pmd is still valid */
> > @@ -1210,11 +1229,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >
> > anon_vma_lock_write(vma->anon_vma);
> >
> > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > - address + HPAGE_PMD_SIZE);
> > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > + _address + (PAGE_SIZE << order));
>
> This _address is horrible. That really does have to change.
>
> > mmu_notifier_invalidate_range_start(&range);
> >
> > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
>
> Odd whitespace...
>
> > /*
> > * This removes any huge TLB entry from the CPU so we won't allow
> > * huge and small TLB entries for the same virtual address to
> > @@ -1228,19 +1248,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > mmu_notifier_invalidate_range_end(&range);
> > tlb_remove_table_sync_one();
> >
> > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > + pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>
> I see we already have a 'convention' of _ prefix on the pmd param, but two
> wrongs don't make a right...
>
> > if (pte) {
> > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > - &compound_pagelist,
> > - HPAGE_PMD_ORDER);
> > + result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > + &compound_pagelist, order);
> > spin_unlock(pte_ptl);
> > } else {
> > result = SCAN_PMD_NULL;
> > }
> >
> > if (unlikely(result != SCAN_SUCCEED)) {
> > - if (pte)
> > - pte_unmap(pte);
>
> Why are we removing this?
>
> > spin_lock(pmd_ptl);
> > BUG_ON(!pmd_none(*pmd));
> > /*
> > @@ -1255,17 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > }
> >
> > /*
> > - * All pages are isolated and locked so anon_vma rmap
> > - * can't run anymore.
> > + * For PMD collapse all pages are isolated and locked so anon_vma
> > + * rmap can't run anymore
> > */
> > - anon_vma_unlock_write(vma->anon_vma);
> > + if (order == HPAGE_PMD_ORDER)
> > + anon_vma_unlock_write(vma->anon_vma);
>
> Hmm this is introducing a horrible new way for things to go wrong. And
> there's now a whole host of terrible error paths that can go wrong very
> easily around rmap locks and yeah, no way we cannot do it this way.
>
> rmap locks are VERY sensitive and the ordering of the locking matters a
> great deal (see top of mm/rmap.c). So we have to be SO careful here.
>
> I suggest you simply have a boolean 'anon_vma_locked' or something like
> this, and get rid of these horrible additional code paths and the second
> order == HPAGE_PMD_ORDER check.
>
> We'll track whether or not the lock is held and thereby needs releasing
> that way instead.
>
> Also, and very importantly - are you 100% sure you can't possibly have a
> deadlock or issue beyond this point if you don't release the rmap lock?
I double checked, this was added as a fix to an issue Hugh reported.
The gap between these callers is rather small, and I see no way that
it could skip the lock/unlock cycle.
>
> This is veeeery important, as there can be implicit assumptions around
> whether or not one can acquire these locks and you basically have to audit
> ALL code over which this lock is held.
>
> I'm speaking from hard experience here having bumped into this in various
> attempts at work relating to this stuff...
>
> >
> > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > - vma, address, pte_ptl,
> > - &compound_pagelist, HPAGE_PMD_ORDER);
> > - pte_unmap(pte);
> > + vma, _address, pte_ptl,
> > + &compound_pagelist, order);
> > if (unlikely(result != SCAN_SUCCEED))
> > - goto out_up_write;
> > + goto out_unlock_anon_vma;
>
> See above...
>
> >
> > /*
> > * The smp_wmb() inside __folio_mark_uptodate() ensures the
> > @@ -1273,33 +1290,115 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > * write.
> > */
> > __folio_mark_uptodate(folio);
> > - pgtable = pmd_pgtable(_pmd);
> > -
> > - _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > - spin_lock(pmd_ptl);
> > - BUG_ON(!pmd_none(*pmd));
> > - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > - folio_add_lru_vma(folio, vma);
> > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > - set_pmd_at(mm, address, pmd, _pmd);
> > - update_mmu_cache_pmd(vma, address, pmd);
> > - deferred_split_folio(folio, false);
> > - spin_unlock(pmd_ptl);
> > + if (order == HPAGE_PMD_ORDER) {
> > + pgtable = pmd_pgtable(_pmd);
> > + _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > + _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > + spin_lock(pmd_ptl);
> > + BUG_ON(!pmd_none(*pmd));
>
> I know you're refactoring this, but be good to change this to a
> WARN_ON_ONCE(), BUG_ON() is verboten unless it's absolutely definitely
> going to be a kernel nuclear event, so worth changing things up as we go.
Yeah i keep seeing those warning in checkpatch, so Ill go ahead and edit it.
>
> > + folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > + folio_add_lru_vma(folio, vma);
> > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > + set_pmd_at(mm, address, pmd, _pmd);
> > + update_mmu_cache_pmd(vma, address, pmd);
> > + deferred_split_folio(folio, false);
> > + spin_unlock(pmd_ptl);
> > + } else { /* mTHP collapse */
> > + mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
>
> I guess it's a rule that each THP or mTHP range spanned must span one and
> only one folio.
>
> Not sure &folio->page has a future though.
>
> Maybe better to use folio_page(folio, 0)?
Ok sounds good I'll use that.
>
> > + mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > + spin_lock(pmd_ptl);
> > + BUG_ON(!pmd_none(*pmd));
>
> having said the above, this is trictly introducing a new BUG_ON() which is
> a no-no, please make it a WARN_ON_ONCE().
>
> > + folio_ref_add(folio, (1 << order) - 1);
>
> Again no 1 << x please.
>
> Do we do something similar somewhere else for mthp ref counting? Can we
> share code somehow?
Yeah but IIRC its only like 2 or 3 places that do something like
this... most callers to folio_add_* do things in slightly different
manners. Maybe something to look into for the future, but I think it
will be difficult to generalize it.
>
> > + folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > + folio_add_lru_vma(folio, vma);
> > + set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
>
> Please avoid 1 << order, and I think at this point since you reference it a
> bunch of times, just store a local var like nr_pages or sth?
yeah not a bad idea!
>
> > + update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +
> > + smp_wmb(); /* make pte visible before pmd */
>
> Can you give some detail as to why this will work here and why it is
> necessary?
Other parts of the kernel do it when setting ptes before updating the
PMD. I'm not sure if it's necessary, but better safe than sorry.
>
> > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> If we're updating PTE entriess why do we need to assign the PMD entry?
We removed the PMD entry for GUP_fast reasons, then we reinstall the
PMD entry after the mTHP is in place. Same as for PMD collapse.
>
> > + spin_unlock(pmd_ptl);
> > + }
>
> This deeply, badly needs to be refactored into something that both shares
> code and separates out these two operations.
>
> This function is disgustingly long as it is, and that's not your fault, but
> let's try to make things better as we go.
>
> >
> > folio = NULL;
> >
> > result = SCAN_SUCCEED;
> > +out_unlock_anon_vma:
> > + if (order != HPAGE_PMD_ORDER)
> > + anon_vma_unlock_write(vma->anon_vma);
>
> Obviously again as above, we need to simplify this and get rid of this
> whole bit.
>
> > out_up_write:
> > + if (pte)
> > + pte_unmap(pte);
>
> OK I guess you moved this from above down here? Is this a valid place to do this?
Yes if not we were potentially unmapping a pte early.
>
> > mmap_write_unlock(mm);
> > out_nolock:
> > + *mmap_locked = false;
>
> This is kind of horrible, we now have pretty mad logic around who sets
> mmap_locked and where.
>
> Can we just do this at the call sites so we avoid that?
>
> I mean anything we do with this is hideous, but that'd be less confusing It
> hink.
>
> > if (folio)
> > folio_put(folio);
> > trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > return result;
> > }
> >
> > +/* Recursive function to consume the bitmap */
>
> Err... please don't? Kernel stack is a seriously finite resource, we do not
> want recursion at all.
>
> But I'm not actually seeing any recursion here? Am I missing something?
>
> > +static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > + int referenced, int unmapped, struct collapse_control *cc,
> > + bool *mmap_locked, unsigned long enabled_orders)
>
> This is a complicated and confusing function, it requires a comment
> describing how it works.
Ok will do!
>
> > +{
> > + u8 order, next_order;
> > + u16 offset, mid_offset;
> > + int num_chunks;
> > + int bits_set, threshold_bits;
> > + int top = -1;
>
> Err why do we start at -1 then immediately increment it?
You are correct, it was probably a leftover bit from my development
phase. Seems I can just set it to 0 to begin with.
>
> > + int collapsed = 0;
> > + int ret;
> > + struct scan_bit_state state;
> > + bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
>
> Extraneous outer parens.
ack
>
> > +
> > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > + { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 };
>
> This is the same as
>
> cc->mthp_bitmap_stack[0] = ...;
> top = 1;
>
> No?
no it would be bitmap_stack[0] = ...
then top goes to -1 (at state =... ), and if we add more items
(next_order) to the stack it would go top = 1 (adds one for each half
of the split)
>
>
> This is really horrible. Can we just have a helper function for this
> please?
Seems kinda excessive for 4 lines and one caller.
>
> Like:
>
> static int mthp_push_stack(struct collapse_control *cc,
> int index, u8 order, u16 offset)
> {
> struct scan_bit_state *state = &cc->mthp_bitmap_stack[index];
>
> VM_WARN_ON(index >= MAX_MTHP_BITMAP_SIZE);
>
> state->order = order;
> state->offset = offset;
>
> return index + 1;
> }
This would not work in its current state because its ++index in the
current implementation. I would need to refactor, but the general idea
still stands
>
> And can invoke via:
>
> top = mthp_push_stack(cc, top, order, offset);
>
> Or pass index as a pointer possibly also.
>
> > +
> > + while (top >= 0) {
> > + state = cc->mthp_bitmap_stack[top--];
>
> OK so this is the recursive bit...
>
> Oh man this function so needs a comment describing what it does, seriously.
>
> I think honestly for sake of my own sanity I'm going to hold off reviewing
> the rest of this until there's something describing the algorithm, in
> detail here, above the function.
It's basically binary recursion with a stack structure, that checks
regions of the bitmap in descending order (ie order 9, order 8, ...)
if we go to the next order we add two items to the stack (left and
right half). I will add a comment describing it at the top of the
function.
>
> > + order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
> > + offset = state.offset;
> > + num_chunks = 1 << (state.order);
> > + /* Skip mTHP orders that are not enabled */
> > + if (!test_bit(order, &enabled_orders))
> > + goto next_order;
> > +
> > + /* copy the relavant section to a new bitmap */
> > + bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> > + MTHP_BITMAP_SIZE);
> > +
> > + bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> > + threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
> > + >> (HPAGE_PMD_ORDER - state.order);
> > +
> > + /* Check if the region is "almost full" based on the threshold */
> > + if (bits_set > threshold_bits || is_pmd_only
> > + || test_bit(order, &huge_anon_orders_always)) {
> > + ret = collapse_huge_page(mm, address, referenced, unmapped,
> > + cc, mmap_locked, order,
> > + offset * KHUGEPAGED_MIN_MTHP_NR);
> > + if (ret == SCAN_SUCCEED) {
> > + collapsed += (1 << order);
> > + continue;
> > + }
> > + }
> > +
> > +next_order:
> > + if (state.order > 0) {
> > + next_order = state.order - 1;
> > + mid_offset = offset + (num_chunks / 2);
> > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > + { next_order, mid_offset };
> > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > + { next_order, offset };
> > + }
> > + }
> > + return collapsed;
> > +}
> > +
> > static int collapse_scan_pmd(struct mm_struct *mm,
> > struct vm_area_struct *vma,
> > unsigned long address, bool *mmap_locked,
> > @@ -1307,31 +1406,60 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> > {
> > pmd_t *pmd;
> > pte_t *pte, *_pte;
> > + int i;
> > int result = SCAN_FAIL, referenced = 0;
> > int none_or_zero = 0, shared = 0;
> > struct page *page = NULL;
> > struct folio *folio = NULL;
> > unsigned long _address;
> > + unsigned long enabled_orders;
> > spinlock_t *ptl;
> > int node = NUMA_NO_NODE, unmapped = 0;
> > + bool is_pmd_only;
> > bool writable = false;
> > -
> > + int chunk_none_count = 0;
> > + int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
> > + unsigned long tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > if (result != SCAN_SUCCEED)
> > goto out;
> >
> > + bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > + bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>
> Having this 'temp' thing on the stack for everyone is just horrid.
As I mention above this serves a very good purpose, and is also
expanded in another series by Baolin to serve another similar purpose
too.
>
> > memset(cc->node_load, 0, sizeof(cc->node_load));
> > nodes_clear(cc->alloc_nmask);
> > +
> > + if (cc->is_khugepaged)
> > + enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > + tva_flags, THP_ORDERS_ALL_ANON);
> > + else
> > + enabled_orders = BIT(HPAGE_PMD_ORDER);
> > +
> > + is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
>
> This is horrid, can we have a function broken out to do this please?
>
> In general if you keep open coding stuff, just write a static function for
> it, the compiler is smart enough to inline.
ok, we do this is a few places so perhaps its the best approach.
>
> > +
> > pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > if (!pte) {
> > result = SCAN_PMD_NULL;
> > goto out;
> > }
> >
> > - for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > - _pte++, _address += PAGE_SIZE) {
> > + for (i = 0; i < HPAGE_PMD_NR; i++) {
> > + /*
> > + * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
> > + * there are pages in this chunk keep track of it in the bitmap
> > + * for mTHP collapsing.
> > + */
> > + if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
> > + if (i > 0 && chunk_none_count <= scaled_none)
> > + bitmap_set(cc->mthp_bitmap,
> > + (i - 1) / KHUGEPAGED_MIN_MTHP_NR, 1);
> > + chunk_none_count = 0;
> > + }
>
> This whole thing is really confusing and you are not explaining the
> algoritm here at all.
>
> This requires a comment, and really this bit should be separated out please.
This used to be its own commit, but multiple people wanted it
squashed... ugh. Which should we go with?
>
> > +
> > + _pte = pte + i;
> > + _address = address + i * PAGE_SIZE;
> > pte_t pteval = ptep_get(_pte);
> > if (is_swap_pte(pteval)) {
> > ++unmapped;
> > @@ -1354,10 +1482,11 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> > }
> > }
> > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > + ++chunk_none_count;
> > ++none_or_zero;
> > if (!userfaultfd_armed(vma) &&
> > - (!cc->is_khugepaged ||
> > - none_or_zero <= khugepaged_max_ptes_none)) {
> > + (!cc->is_khugepaged || !is_pmd_only ||
> > + none_or_zero <= khugepaged_max_ptes_none)) {
> > continue;
> > } else {
> > result = SCAN_EXCEED_NONE_PTE;
> > @@ -1453,6 +1582,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> > address)))
> > referenced++;
> > }
> > +
> > if (!writable) {
> > result = SCAN_PAGE_RO;
> > } else if (cc->is_khugepaged &&
> > @@ -1465,10 +1595,12 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> > out_unmap:
> > pte_unmap_unlock(pte, ptl);
> > if (result == SCAN_SUCCEED) {
> > - result = collapse_huge_page(mm, address, referenced,
> > - unmapped, cc);
> > - /* collapse_huge_page will return with the mmap_lock released */
> > - *mmap_locked = false;
> > + result = collapse_scan_bitmap(mm, address, referenced, unmapped, cc,
> > + mmap_locked, enabled_orders);
> > + if (result > 0)
> > + result = SCAN_SUCCEED;
> > + else
> > + result = SCAN_FAIL;
>
> We're reusing result as both an enum value and as a storage for unmber
> colapsed PTE entries?
>
> Can we just use a new local variable? Thanks
>
> > }
> > out:
> > trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
> > --
> > 2.50.1
> >
>
> I will review the bitmap/chunk stuff in more detail once the algorithm is
> commented.
ok thanks for the review.
>
> Cheers, Lorenzo
>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-09-02 11:03 ` David Hildenbrand
@ 2025-09-02 20:23 ` Usama Arif
2025-09-03 3:27 ` Baolin Wang
0 siblings, 1 reply; 75+ messages in thread
From: Usama Arif @ 2025-09-02 20:23 UTC (permalink / raw)
To: David Hildenbrand, Baolin Wang, Dev Jain, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap, hughd
On 02/09/2025 12:03, David Hildenbrand wrote:
> On 02.09.25 12:34, Usama Arif wrote:
>>
>>
>> On 02/09/2025 10:03, David Hildenbrand wrote:
>>> On 02.09.25 04:28, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>>>
>>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>>>> (Sorry for chiming in late)
>>>>>>>>
>>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>>>> but not sure
>>>>>>>>>>> if we have to add that for now.
>>>>>>>>>>
>>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>>>> yes you
>>>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>>>> confused, esp if
>>>>>>>>>> it's a calculated value.
>>>>>>>>>>
>>>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>>>> don't just have
>>>>>>>>>> something VERY simple like on/off.
>>>>>>>>>
>>>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>>>> really.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>>>> fixed elsewhere
>>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>>>> wrong - and
>>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>>>> feels that
>>>>>>>>>> way).
>>>>>>>>>
>>>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>>>
>>>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>>>> first a
>>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>>>
>>>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>>>> least one pte used".
>>>>>>>>
>>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>>>> highest enabled order would ever be collapsed."
>>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>>>> highest order folio.
>>>>>>
>>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>>>> mean is, as in the example I gave below, users may only want to allow a
>>>>>> large order collapse when the number of present PTEs reaches half of the
>>>>>> large folio, in order to avoid RSS bloat.
>>>>>
>>>>> How do these users control allocation at fault time where this parameter
>>>>> is completely ignored?
>>>>
>>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>>>> control allocation at fault time? Could you be more specific? Thanks.
>>>
>>> The comment over khugepaged_max_ptes_none gives a hint:
>>>
>>> /*
>>> * default collapse hugepages if there is at least one pte mapped like
>>> * it would have happened if the vma was large enough during page
>>> * fault.
>>> *
>>> * Note that these are only respected if collapse was initiated by khugepaged.
>>> */
>>>
>>> In the common case (for anything that really cares about RSS bloat) you will just a
>>> get a THP during page fault and consequently RSS bloat.
>>>
>>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
>>> to be when an application later (after once possibly getting a THP already during
>>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
>>> MADV_COLLAPSE.
>>>
>>> It's a questionable use case, that already got more problematic with mTHP and page
>>> table reclaim.
>>>
>>> Let me explain:
>>>
>>> Before mTHP, if someone would MADV_DONTNEED (resulting in
>>> a page table with at least one pte_none entry), there would have been no way we would
>>> get memory over-allocated afterwards with max_ptes_none=0.
>>>
>>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
>>> (2) khugepaged was told to not collapse through max_ptes_none=0.
>>>
>>> But now:
>>>
>>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>>> an area again: page faults will simply spot a bunch of pte_nones around the fault area
>>> and install an mTHP.
>>>
>>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>>> page table. The next page fault will just try installing a PMD THP again, because there is
>>> no PTE table anymore.
>>>
>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>>
>>>
>>
>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>> will break down those hugepages and free up zero-filled memory.
>
> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
>
> I have seen in our prod workloads where
>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>> of THPs like lower TLB misses.
>
> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
>
Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
> zero pages.
>
>>
>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>> that the memory regression of using THP=always vs THP=madvise is halved.
>
> To which value would you set it? Just 510? 0?
>
There are some very large workloads in the meta fleet that I experimented with and found that having
a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
there for these workloads, but not possible to experiment with every value.
In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value)
when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as
THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse
pages that are dominated by 4K zero-filled chunks.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v10 00/13] khugepaged: mTHP support
2025-09-02 20:23 ` Usama Arif
@ 2025-09-03 3:27 ` Baolin Wang
0 siblings, 0 replies; 75+ messages in thread
From: Baolin Wang @ 2025-09-03 3:27 UTC (permalink / raw)
To: Usama Arif, David Hildenbrand, Dev Jain, Lorenzo Stoakes
Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-trace-kernel,
ziy, Liam.Howlett, ryan.roberts, corbet, rostedt, mhiramat,
mathieu.desnoyers, akpm, baohua, willy, peterx, wangkefeng.wang,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kirill.shutemov,
aarcange, raquini, anshuman.khandual, catalin.marinas, tiwai,
will, dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap, hughd
On 2025/9/3 04:23, Usama Arif wrote:
>
>
> On 02/09/2025 12:03, David Hildenbrand wrote:
>> On 02.09.25 12:34, Usama Arif wrote:
>>>
>>>
>>> On 02/09/2025 10:03, David Hildenbrand wrote:
>>>> On 02.09.25 04:28, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>>>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>>>>
>>>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>>>>> (Sorry for chiming in late)
>>>>>>>>>
>>>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>>>>> but not sure
>>>>>>>>>>>> if we have to add that for now.
>>>>>>>>>>>
>>>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>>>>> yes you
>>>>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>>>>> confused, esp if
>>>>>>>>>>> it's a calculated value.
>>>>>>>>>>>
>>>>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>>>>> don't just have
>>>>>>>>>>> something VERY simple like on/off.
>>>>>>>>>>
>>>>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>>>>> really.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>>>>> fixed elsewhere
>>>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>>>>> wrong - and
>>>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>>>>> feels that
>>>>>>>>>>> way).
>>>>>>>>>>
>>>>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>>>>
>>>>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>>>>> first a
>>>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>>>>
>>>>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>>>>> least one pte used".
>>>>>>>>>
>>>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>>>>> highest enabled order would ever be collapsed."
>>>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>>>>> highest order folio.
>>>>>>>
>>>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>>>>> mean is, as in the example I gave below, users may only want to allow a
>>>>>>> large order collapse when the number of present PTEs reaches half of the
>>>>>>> large folio, in order to avoid RSS bloat.
>>>>>>
>>>>>> How do these users control allocation at fault time where this parameter
>>>>>> is completely ignored?
>>>>>
>>>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>>>>> control allocation at fault time? Could you be more specific? Thanks.
>>>>
>>>> The comment over khugepaged_max_ptes_none gives a hint:
>>>>
>>>> /*
>>>> * default collapse hugepages if there is at least one pte mapped like
>>>> * it would have happened if the vma was large enough during page
>>>> * fault.
>>>> *
>>>> * Note that these are only respected if collapse was initiated by khugepaged.
>>>> */
>>>>
>>>> In the common case (for anything that really cares about RSS bloat) you will just a
>>>> get a THP during page fault and consequently RSS bloat.
>>>>
>>>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
>>>> to be when an application later (after once possibly getting a THP already during
>>>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
>>>> MADV_COLLAPSE.
>>>>
>>>> It's a questionable use case, that already got more problematic with mTHP and page
>>>> table reclaim.
>>>>
>>>> Let me explain:
>>>>
>>>> Before mTHP, if someone would MADV_DONTNEED (resulting in
>>>> a page table with at least one pte_none entry), there would have been no way we would
>>>> get memory over-allocated afterwards with max_ptes_none=0.
>>>>
>>>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
>>>> (2) khugepaged was told to not collapse through max_ptes_none=0.
>>>>
>>>> But now:
>>>>
>>>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>>>> an area again: page faults will simply spot a bunch of pte_nones around the fault area
>>>> and install an mTHP.
>>>>
>>>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>>>> page table. The next page fault will just try installing a PMD THP again, because there is
>>>> no PTE table anymore.
>>>>
>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
Thanks David for your explanation. I see your point now.
>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>>> will break down those hugepages and free up zero-filled memory.
>>
>> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
>>
>> I have seen in our prod workloads where
>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>>> of THPs like lower TLB misses.
>>
>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
>>
>
> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
>
>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
>> zero pages.
>>
>>>
>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>>> that the memory regression of using THP=always vs THP=madvise is halved.
>>
>> To which value would you set it? Just 510? 0?
>>
>
> There are some very large workloads in the meta fleet that I experimented with and found that having
> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
> there for these workloads, but not possible to experiment with every value.
>
> In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value)
> when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as
> THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse
> pages that are dominated by 4K zero-filled chunks.
Thanks for sharing this. We're also investigating what max_ptes_none
should be set to in order to use the THP shrinker properly, and
currently, our customers always set max_ptes_none to its default value:
511, which is not good.
If 0 is better, it seems like there isn't much conflict with the values
expected by mTHP collapse (0 and 511). Sounds good to me.
^ permalink raw reply [flat|nested] 75+ messages in thread
end of thread, other threads:[~2025-09-03 3:28 UTC | newest]
Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
2025-08-19 13:41 ` [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2025-08-20 10:42 ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-08-20 11:21 ` Lorenzo Stoakes
2025-08-20 16:35 ` Nico Pache
2025-08-22 10:21 ` Lorenzo Stoakes
2025-08-26 13:30 ` Nico Pache
2025-08-19 13:41 ` [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-08-20 13:23 ` Lorenzo Stoakes
2025-08-20 15:40 ` Nico Pache
2025-08-21 3:41 ` Wei Yang
2025-08-21 14:09 ` Zi Yan
2025-08-22 10:25 ` Lorenzo Stoakes
2025-08-24 1:37 ` Wei Yang
2025-08-26 13:46 ` Nico Pache
2025-08-19 13:41 ` [PATCH v10 04/13] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-08-20 13:28 ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-08-20 14:22 ` Lorenzo Stoakes
2025-09-01 16:15 ` David Hildenbrand
2025-08-19 13:41 ` [PATCH v10 06/13] khugepaged: add " Nico Pache
2025-08-20 18:29 ` Lorenzo Stoakes
2025-09-02 20:12 ` Nico Pache
2025-08-19 13:41 ` [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-08-21 12:05 ` Lorenzo Stoakes
2025-08-21 12:33 ` Dev Jain
2025-08-22 10:33 ` Lorenzo Stoakes
2025-08-21 16:54 ` Steven Rostedt
2025-08-21 16:56 ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 08/13] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-08-20 10:38 ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 09/13] khugepaged: enable collapsing mTHPs even when PMD THPs are disabled Nico Pache
2025-08-21 13:35 ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
2025-08-21 14:18 ` Lorenzo Stoakes
2025-08-21 14:26 ` Lorenzo Stoakes
2025-08-22 6:59 ` Baolin Wang
2025-08-22 7:36 ` Dev Jain
2025-08-19 13:42 ` [PATCH v10 11/13] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-08-21 14:24 ` Lorenzo Stoakes
2025-08-19 14:16 ` [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-08-21 14:47 ` Lorenzo Stoakes
2025-08-19 14:17 ` [PATCH v10 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-08-21 15:03 ` Lorenzo Stoakes
2025-08-19 21:55 ` [PATCH v10 00/13] khugepaged: mTHP support Andrew Morton
2025-08-20 15:55 ` Nico Pache
2025-08-21 15:01 ` Lorenzo Stoakes
2025-08-21 15:13 ` Dev Jain
2025-08-21 15:19 ` Lorenzo Stoakes
2025-08-21 15:25 ` Nico Pache
2025-08-21 15:27 ` Nico Pache
2025-08-21 15:32 ` Lorenzo Stoakes
2025-08-21 16:46 ` Nico Pache
2025-08-21 16:54 ` Lorenzo Stoakes
2025-08-21 17:26 ` David Hildenbrand
2025-08-21 20:43 ` David Hildenbrand
2025-08-22 10:41 ` Lorenzo Stoakes
2025-08-22 14:10 ` David Hildenbrand
2025-08-22 14:49 ` Lorenzo Stoakes
2025-08-22 15:33 ` Dev Jain
2025-08-26 10:43 ` Lorenzo Stoakes
2025-08-28 9:46 ` Baolin Wang
2025-08-28 10:48 ` Dev Jain
2025-08-29 1:55 ` Baolin Wang
2025-09-01 16:46 ` David Hildenbrand
2025-09-02 2:28 ` Baolin Wang
2025-09-02 9:03 ` David Hildenbrand
2025-09-02 10:34 ` Usama Arif
2025-09-02 11:03 ` David Hildenbrand
2025-09-02 20:23 ` Usama Arif
2025-09-03 3:27 ` Baolin Wang
2025-08-21 16:38 ` Liam R. Howlett
2025-09-01 16:21 ` David Hildenbrand
2025-09-01 17:06 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).