From: Andrew Morton <akpm@linux-foundation.org>
To: mm-commits@vger.kernel.org,npache@redhat.com,akpm@linux-foundation.org
Subject: [to-be-updated] mm-khugepaged-generalize-__collapse_huge_page_-for-mthp-support.patch removed from -mm tree
Date: Mon, 11 May 2026 13:55:59 -0700 [thread overview]
Message-ID: <20260511205600.22681C2BCB0@smtp.kernel.org> (raw)
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 15187 bytes --]
The quilt patch titled
Subject: mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
has been removed from the -mm tree. Its filename was
mm-khugepaged-generalize-__collapse_huge_page_-for-mthp-support.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Nico Pache <npache@redhat.com>
Subject: mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
Date: Sun, 19 Apr 2026 12:57:41 -0600
Generalize the order of the __collapse_huge_page_* and collapse_max_*
functions to support future mTHP collapse.
The current mechanism for determining collapse with the
khugepaged_max_ptes_none value is not designed with mTHP in mind. This
raises a key design issue: if we support user defined max_pte_none values
(even those scaled by order), a collapse of a lower order can introduces
an feedback loop, or "creep", when max_ptes_none is set to a value greater
than HPAGE_PMD_NR / 2.
With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.
To fix this issue introduce a helper function that will limit mTHP
collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
This effectively supports two modes:
- max_ptes_none=0: never introduce new none-pages for mTHP collapse.
- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
available mTHP order.
This removes the possiblilty of "creep", while not modifying any uAPI
expectations. A warning will be emitted if any non-supported
max_ptes_none value is configured with mTHP enabled.
mTHP collapse will not honor the khugepaged_max_ptes_shared or
khugepaged_max_ptes_swap parameters, and will fail if it encounters a
shared or swapped entry.
No functional changes in this patch; however it defines future behavior
for mTHP collapse.
Link: https://lore.kernel.org/20260419185750.260784-5-npache@redhat.com
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rafael Aquini <raquini@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shivank Garg <shivankg@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Takashi Iwai (SUSE) <tiwai@suse.de>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Usama Arif <usama.arif@linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/khugepaged.c | 124 ++++++++++++++++++++++++++++++++--------------
1 file changed, 88 insertions(+), 36 deletions(-)
--- a/mm/khugepaged.c~mm-khugepaged-generalize-__collapse_huge_page_-for-mthp-support
+++ a/mm/khugepaged.c
@@ -352,51 +352,86 @@ static bool pte_none_or_zero(pte_t pte)
* collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
* @cc: The collapse control struct
* @vma: The vma to check for userfaultfd
+ * @order: The folio order being collapsed to
*
* If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
- * empty page.
+ * empty page. For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the
+ * configured khugepaged_max_ptes_none value.
+ *
+ * For mTHP collapses, we currently only support khugepaged_max_pte_none values
+ * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warning and
+ * no mTHP collapse will be attempted
*
* Return: Maximum number of empty PTEs allowed for the collapse operation
*/
-static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
- struct vm_area_struct *vma)
+static int collapse_max_ptes_none(struct collapse_control *cc,
+ struct vm_area_struct *vma, unsigned int order)
{
if (vma && userfaultfd_armed(vma))
return 0;
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
- return khugepaged_max_ptes_none;
+ if (is_pmd_order(order))
+ return khugepaged_max_ptes_none;
+ /* Zero/non-present collapse disabled. */
+ if (!khugepaged_max_ptes_none)
+ return 0;
+ if (khugepaged_max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
+ return (1 << order) - 1;
+
+ pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
+ KHUGEPAGED_MAX_PTES_LIMIT);
+ return -EINVAL;
}
/**
* collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
* @cc: The collapse control struct
+ * @order: The folio order being collapsed to
*
* If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
* shared page.
*
+ * For mTHP collapses, we currently dont support collapsing memory with
+ * shared memory.
+ *
* Return: Maximum number of shared PTEs allowed for the collapse operation
*/
-static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
+ unsigned int order)
{
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
+ if (!is_pmd_order(order))
+ return 0;
+
return khugepaged_max_ptes_shared;
}
/**
* collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
* @cc: The collapse control struct
+ * @order: The folio order being collapsed to
*
* If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
* swap page.
*
+ * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
+ * khugepaged_max_ptes_swap value.
+ *
+ * For mTHP collapses, we currently dont support collapsing memory with
+ * swapped out memory.
+ *
* Return: Maximum number of swap PTEs allowed for the collapse operation
*/
-static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
+ unsigned int order)
{
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
+ if (!is_pmd_order(order))
+ return 0;
+
return khugepaged_max_ptes_swap;
}
@@ -590,18 +625,22 @@ static void release_pte_pages(pte_t *pte
static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
- struct list_head *compound_pagelist)
+ unsigned int order, struct list_head *compound_pagelist)
{
+ const unsigned long nr_pages = 1UL << order;
struct page *page = NULL;
struct folio *folio = NULL;
unsigned long addr = start_addr;
pte_t *_pte;
int none_or_zero = 0, shared = 0, referenced = 0;
enum scan_result result = SCAN_FAIL;
- unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
- unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
+ int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+ unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
+
+ if (max_ptes_none < 0)
+ return result;
- for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+ for (_pte = pte; _pte < pte + nr_pages;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
if (pte_none_or_zero(pteval)) {
@@ -734,18 +773,18 @@ out:
}
static void __collapse_huge_page_copy_succeeded(pte_t *pte,
- struct vm_area_struct *vma,
- unsigned long address,
- spinlock_t *ptl,
- struct list_head *compound_pagelist)
+ struct vm_area_struct *vma, unsigned long address,
+ spinlock_t *ptl, unsigned int order,
+ struct list_head *compound_pagelist)
{
- unsigned long end = address + HPAGE_PMD_SIZE;
+ const unsigned long nr_pages = 1UL << order;
+ unsigned long end = address + (PAGE_SIZE << order);
struct folio *src, *tmp;
pte_t pteval;
pte_t *_pte;
unsigned int nr_ptes;
- for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+ for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
address += nr_ptes * PAGE_SIZE) {
nr_ptes = 1;
pteval = ptep_get(_pte);
@@ -798,13 +837,11 @@ static void __collapse_huge_page_copy_su
}
static void __collapse_huge_page_copy_failed(pte_t *pte,
- pmd_t *pmd,
- pmd_t orig_pmd,
- struct vm_area_struct *vma,
- struct list_head *compound_pagelist)
+ pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
+ unsigned int order, struct list_head *compound_pagelist)
{
+ const unsigned long nr_pages = 1UL << order;
spinlock_t *pmd_ptl;
-
/*
* Re-establish the PMD to point to the original page table
* entry. Restoring PMD needs to be done prior to releasing
@@ -818,7 +855,7 @@ static void __collapse_huge_page_copy_fa
* Release both raw and compound pages isolated
* in __collapse_huge_page_isolate.
*/
- release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+ release_pte_pages(pte, pte + nr_pages, compound_pagelist);
}
/*
@@ -838,16 +875,16 @@ static void __collapse_huge_page_copy_fa
*/
static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
- unsigned long address, spinlock_t *ptl,
+ unsigned long address, spinlock_t *ptl, unsigned int order,
struct list_head *compound_pagelist)
{
+ const unsigned long nr_pages = 1UL << order;
unsigned int i;
enum scan_result result = SCAN_SUCCEED;
-
/*
* Copying pages' contents is subject to memory poison at any iteration.
*/
- for (i = 0; i < HPAGE_PMD_NR; i++) {
+ for (i = 0; i < nr_pages; i++) {
pte_t pteval = ptep_get(pte + i);
struct page *page = folio_page(folio, i);
unsigned long src_addr = address + i * PAGE_SIZE;
@@ -866,10 +903,10 @@ static enum scan_result __collapse_huge_
if (likely(result == SCAN_SUCCEED))
__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
- compound_pagelist);
+ order, compound_pagelist);
else
__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
- compound_pagelist);
+ order, compound_pagelist);
return result;
}
@@ -1040,12 +1077,12 @@ static enum scan_result check_pmd_still_
* Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
*/
static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
- int referenced)
+ struct vm_area_struct *vma, unsigned long start_addr,
+ pmd_t *pmd, int referenced, unsigned int order)
{
int swapped_in = 0;
vm_fault_t ret = 0;
- unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
+ unsigned long addr, end = start_addr + (PAGE_SIZE << order);
enum scan_result result;
pte_t *pte = NULL;
spinlock_t *ptl;
@@ -1077,6 +1114,19 @@ static enum scan_result __collapse_huge_
pte_present(vmf.orig_pte))
continue;
+ /*
+ * TODO: Support swapin without leading to further mTHP
+ * collapses. Currently bringing in new pages via swapin may
+ * cause a future higher order collapse on a rescan of the same
+ * range.
+ */
+ if (!is_pmd_order(order)) {
+ pte_unmap(pte);
+ mmap_read_unlock(mm);
+ result = SCAN_EXCEED_SWAP_PTE;
+ goto out;
+ }
+
vmf.pte = pte;
vmf.ptl = ptl;
ret = do_swap_page(&vmf);
@@ -1196,7 +1246,7 @@ static enum scan_result collapse_huge_pa
* that case. Continuing to collapse causes inconsistency.
*/
result = __collapse_huge_page_swapin(mm, vma, address, pmd,
- referenced);
+ referenced, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out_nolock;
}
@@ -1244,6 +1294,7 @@ static enum scan_result collapse_huge_pa
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
result = __collapse_huge_page_isolate(vma, address, pte, cc,
+ HPAGE_PMD_ORDER,
&compound_pagelist);
spin_unlock(pte_ptl);
} else {
@@ -1274,6 +1325,7 @@ static enum scan_result collapse_huge_pa
result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
vma, address, pte_ptl,
+ HPAGE_PMD_ORDER,
&compound_pagelist);
pte_unmap(pte);
if (unlikely(result != SCAN_SUCCEED))
@@ -1318,9 +1370,9 @@ static enum scan_result collapse_scan_pm
unsigned long addr;
spinlock_t *ptl;
int node = NUMA_NO_NODE, unmapped = 0;
- unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
- unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
- unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
+ int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
+ unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
+ unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
@@ -2371,8 +2423,8 @@ static enum scan_result collapse_scan_fi
int present, swap;
int node = NUMA_NO_NODE;
enum scan_result result = SCAN_SUCCEED;
- unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
- unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
+ int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
+ unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
present = 0;
swap = 0;
_
Patches currently in -mm which might be from npache@redhat.com are
mm-khugepaged-generalize-collapse_huge_page-for-mthp-collapse.patch
mm-khugepaged-skip-collapsing-mthp-to-smaller-orders.patch
mm-khugepaged-add-per-order-mthp-collapse-failure-statistics.patch
mm-khugepaged-improve-tracepoints-for-mthp-orders.patch
mm-khugepaged-introduce-collapse_allowable_orders-helper-function.patch
mm-khugepaged-introduce-mthp-collapse-support.patch
mm-khugepaged-avoid-unnecessary-mthp-collapse-attempts.patch
documentation-mm-update-the-admin-guide-for-mthp-collapse.patch
next reply other threads:[~2026-05-11 20:56 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-11 20:55 Andrew Morton [this message]
-- strict thread matches above, loose matches on Subject: below --
2026-05-03 13:20 [to-be-updated] mm-khugepaged-generalize-__collapse_huge_page_-for-mthp-support.patch removed from -mm tree Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260511205600.22681C2BCB0@smtp.kernel.org \
--to=akpm@linux-foundation.org \
--cc=mm-commits@vger.kernel.org \
--cc=npache@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.