From: Wei Yang <richard.weiyang@gmail.com>
To: Lance Yang <lance.yang@linux.dev>
Cc: npache@redhat.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, aarcange@redhat.com,
akpm@linux-foundation.org, anshuman.khandual@arm.com,
apopple@nvidia.com, baohua@kernel.org,
baolin.wang@linux.alibaba.com, byungchul@sk.com,
catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net,
dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com,
gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com,
jack@suse.cz, jackmanb@google.com, jannh@google.com,
jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org,
liam@infradead.org, ljs@kernel.org,
mathieu.desnoyers@efficios.com, matthew.brost@intel.com,
mhiramat@kernel.org, mhocko@suse.com, peterx@redhat.com,
pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com,
rdunlap@infradead.org, richard.weiyang@gmail.com,
rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org,
ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com,
surenb@google.com, thomas.hellstrom@linux.intel.com,
tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz,
vishal.moola@gmail.com, wangkefeng.wang@huawei.com,
will@kernel.org, willy@infradead.org,
yang@os.amperecomputing.com, ying.huang@linux.alibaba.com,
ziy@nvidia.com, zokeefe@google.com
Subject: Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
Date: Thu, 14 May 2026 03:10:10 +0000 [thread overview]
Message-ID: <20260514031009.f66cgop3ctgiqxz3@master> (raw)
In-Reply-To: <20260512074202.10253-1-lance.yang@linux.dev>
On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
>
>On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
>>generalize the order of the __collapse_huge_page_* and collapse_max_*
>>functions to support future mTHP collapse.
>>
>>The current mechanism for determining collapse with the
>>khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>>raises a key design issue: if we support user defined max_pte_none values
>>(even those scaled by order), a collapse of a lower order can introduces
>>an feedback loop, or "creep", when max_ptes_none is set to a value greater
>>than HPAGE_PMD_NR / 2. [1]
>>
>>With this configuration, a successful collapse to order N will populate
>>enough pages to satisfy the collapse condition on order N+1 on the next
>>scan. This leads to unnecessary work and memory churn.
>>
>>To fix this issue introduce a helper function that will limit mTHP
>>collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>>This effectively supports two modes: [2]
>>
>>- max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>> that maps the shared zeropage. Consequently, no memory bloat.
>>- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>> available mTHP order.
>>
>>This removes the possiblilty of "creep", while not modifying any uAPI
>>expectations. A warning will be emitted if any non-supported
>>max_ptes_none value is configured with mTHP enabled.
>>
>>mTHP collapse will not honor the khugepaged_max_ptes_shared or
>>khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>>shared or swapped entry.
>>
>>No functional changes in this patch; however it defines future behavior
>>for mTHP collapse.
>>
>>[1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
>>[2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>>
>>Co-developed-by: Dev Jain <dev.jain@arm.com>
>>Signed-off-by: Dev Jain <dev.jain@arm.com>
>>Signed-off-by: Nico Pache <npache@redhat.com>
>>---
>> include/trace/events/huge_memory.h | 3 +-
>> mm/khugepaged.c | 117 ++++++++++++++++++++---------
>> 2 files changed, 85 insertions(+), 35 deletions(-)
>>
>>diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>index bcdc57eea270..443e0bd13fdb 100644
>>--- a/include/trace/events/huge_memory.h
>>+++ b/include/trace/events/huge_memory.h
>>@@ -39,7 +39,8 @@
>> EM( SCAN_STORE_FAILED, "store_failed") \
>> EM( SCAN_COPY_MC, "copy_poisoned_page") \
>> EM( SCAN_PAGE_FILLED, "page_filled") \
>>- EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
>>+ EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
>>+ EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
>>
>> #undef EM
>> #undef EMe
>>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>index f68853b3caa7..27465161fa6d 100644
>>--- a/mm/khugepaged.c
>>+++ b/mm/khugepaged.c
>>@@ -61,6 +61,7 @@ enum scan_result {
>> SCAN_COPY_MC,
>> SCAN_PAGE_FILLED,
>> SCAN_PAGE_DIRTY_OR_WRITEBACK,
>>+ SCAN_INVALID_PTES_NONE,
>> };
>>
>> #define CREATE_TRACE_POINTS
>>@@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
>> * PTEs for the given collapse operation.
>> * @cc: The collapse control struct
>> * @vma: The vma to check for userfaultfd
>>+ * @order: The folio order being collapsed to
>> *
>> * Return: Maximum number of none-page or zero-page PTEs allowed for the
>> * collapse operation.
>> */
>>-static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>>- struct vm_area_struct *vma)
>>+static int collapse_max_ptes_none(struct collapse_control *cc,
>>+ struct vm_area_struct *vma, unsigned int order)
>> {
>>+ unsigned int max_ptes_none = khugepaged_max_ptes_none;
>> // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>
>One thing I still want to call out: kernel code usually uses C-style
>comments :)
>
>> if (vma && userfaultfd_armed(vma))
>> return 0;
>> // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>> if (!cc->is_khugepaged)
>> return HPAGE_PMD_NR;
>>- // For all other cases repect the user defined maximum.
>>- return khugepaged_max_ptes_none;
>>+ // for PMD collapse, respect the user defined maximum.
>>+ if (is_pmd_order(order))
>>+ return max_ptes_none;
>>+ /* Zero/non-present collapse disabled. */
>>+ if (!max_ptes_none)
>>+ return 0;
>>+ // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
>>+ // scale the maximum number of PTEs to the order of the collapse.
>>+ if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>>+ return (1 << order) - 1;
>>+
>>+ // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
>>+ // Emit a warning and return -EINVAL.
>>+ pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>>+ KHUGEPAGED_MAX_PTES_LIMIT);
>
>Maybe fallback to 0 instead, as David suggested earlier?
>
It looks reasonable to fallback to 0.
But as the updated Document says in patch 14:
For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
value will emit a warning and no mTHP collapse will be attempted.
This is why it does like this now.
mthp_collapse()
max_ptes_none = collapse_max_ptes_none();
if (max_ptes_none < 0)
return collapsed;
>max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
>intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
>disable it :(
>
So it depends on what we want to do here :-)
For me, I would vote for fallback to 0.
>Treating those values as 0 feels like the least surprising behavior,
>IMHO. It also gives mTHP a cleaner staring point, rather than carry over
>all the old PMD knob semantics :)
>
>Otherwise, LGTM!
>Reviewed-by: Lance Yang <lance.yang@linux.dev>
>
>>+ return -EINVAL;
--
Wei Yang
Help you, Help me
next prev parent reply other threads:[~2026-05-14 3:10 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-11 18:58 [PATCH mm-unstable v17 00/14] khugepaged: mTHP support Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
2026-05-12 4:44 ` Lance Yang
2026-05-12 7:29 ` David Hildenbrand (Arm)
2026-05-11 18:58 ` [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-05-12 7:42 ` Lance Yang
2026-05-14 3:10 ` Wei Yang [this message]
2026-05-11 18:58 ` [PATCH mm-unstable v17 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
2026-05-12 7:42 ` David Hildenbrand (Arm)
2026-05-11 18:58 ` [PATCH mm-unstable v17 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 09/14] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
2026-05-12 15:44 ` Wei Yang
2026-05-11 18:58 ` [PATCH mm-unstable v17 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 13/14] mm/khugepaged: run khugepaged for all orders Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-05-11 21:04 ` [PATCH mm-unstable v17 00/14] khugepaged: mTHP support Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260514031009.f66cgop3ctgiqxz3@master \
--to=richard.weiyang@gmail.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=byungchul@sk.com \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=jackmanb@google.com \
--cc=jannh@google.com \
--cc=jglisse@google.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kas@kernel.org \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=matthew.brost@intel.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=peterx@redhat.com \
--cc=pfalcato@suse.de \
--cc=rakie.kim@sk.com \
--cc=raquini@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shivankg@amd.com \
--cc=sunnanyong@huawei.com \
--cc=surenb@google.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tiwai@suse.de \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.