From: Lance Yang <lance.yang@linux.dev>
To: npache@redhat.com
Cc: lance.yang@linux.dev, akpm@linux-foundation.org, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
richard.weiyang@gmail.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, david@kernel.org, ljs@kernel.org
Subject: Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
Date: Thu, 18 Jun 2026 21:10:24 +0800 [thread overview]
Message-ID: <20260618131024.69667-1-lance.yang@linux.dev> (raw)
In-Reply-To: <CAA1CXcDv7WVxMcGv9OOZDzTzzcfEAnaGbX5dJGwqN5mSWkmRvA@mail.gmail.com>
+Cc David and Lorenzo, address oops fixed :)
On Thu, Jun 18, 2026 at 06:22:49AM -0600, Nico Pache wrote:
>On Tue, Oct 14, 2025 at 6:37 AM Lorenzo Stoakes
><lorenzo.stoakes@oracle.com> wrote:
>>
>> On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>> > From: Lance Yang <lance.yang@linux.dev>
>> >
>> > As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>> > and __collapse_huge_page_isolate() was almost duplicated.
>> >
>> > This patch cleans things up by moving all the common PTE checking logic
>> > into a new shared helper, thp_collapse_check_pte(). While at it, we use
>> > vm_normal_folio() instead of vm_normal_page().
>> >
>> > Suggested-by: David Hildenbrand <david@redhat.com>
>> > Suggested-by: Dev Jain <dev.jain@arm.com>
>> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>
>> In general I like the idea, the implementation needs work though.
>>
>> Will check in more detail on respin
>>
>> > ---
>> > mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>> > 1 file changed, 130 insertions(+), 113 deletions(-)
>> >
>> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> > index b5c0295c3414..7116caae1fa4 100644
>> > --- a/mm/khugepaged.c
>> > +++ b/mm/khugepaged.c
>> > @@ -61,6 +61,12 @@ enum scan_result {
>> > SCAN_PAGE_FILLED,
>> > };
>> >
>> > +enum pte_check_result {
>> > + PTE_CHECK_SUCCEED,
>> > + PTE_CHECK_CONTINUE,
>>
>> Don't love this logic - this feels like we're essentially abstracting
>> control flow, I mean what does 'continue' mean here? Other than you're in a
>> loop and please continue, which is relying a little too much on what the
>> caller does.
>>
>> if we retain this logic something like PTE_CHECK_SKIP would make more sense.
>>
>>
>> > + PTE_CHECK_FAIL,
>> > +};
>> > +
>> > #define CREATE_TRACE_POINTS
>> > #include <trace/events/huge_memory.h>
>> >
>> > @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>> > }
>> > }
>> >
>> > +/*
>> > + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> > + * @pte: The PTE to check
>> > + * @vma: The VMA the PTE belongs to
>> > + * @addr: The virtual address corresponding to this PTE
>> > + * @foliop: On success, used to return a pointer to the folio
>> > + * Must be non-NULL
>> > + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
>> > + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
>> > + * @shared: Counter for shared pages. Must be non-NULL
>> > + * @scan_result: Used to return the failure reason (SCAN_*) on a
>> > + * PTE_CHECK_FAIL return. Must be non-NULL
>> > + * @cc: Collapse control settings
>>
>> Do we really need this many parameters? THis is hard to follow.
>>
>> Of course it being me, I'd immediately prefer a helper struct :)
>
>Hey!
>
>Ive been trying to reimplement this patch, and started converting this
>to a helper struct... At first glance it seems like the right
>approach, but because the scan/isolate share two slightly different
>implementations, we need to leave some logic in the parent. This
>results in polluting the rest of the code significantly just to save
>one ugly function call with many parameters.
Been a while ... so I don't fully recall anymore :)
If you already have a patch or series, maybe just post it and we can look
at the actual diff.
Probably easier than talking about it without seeing the code :D
Cheers, Lance
>
>not sure what to do about that tbh.
>
>e.g., (just one of many examples)
>
>
>- NR_ISOLATED_ANON + folio_is_file_lru(folio),
>- folio_nr_pages(folio));
>- VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>- VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>+ NR_ISOLATED_ANON + folio_is_file_lru(ctx.folio),
>+ folio_nr_pages(ctx.folio));
>+ VM_BUG_ON_FOLIO(!folio_test_locked(ctx.folio), ctx.folio);
>+ VM_BUG_ON_FOLIO(folio_test_lru(ctx.folio), ctx.folio);
>
>- if (folio_test_large(folio))
>- list_add_tail(&folio->lru, compound_pagelist);
>+ if (folio_test_large(ctx.folio))
>+ list_add_tail(ctx.folio->lru, compound_pagelist);
>
>Are we sure this helper structure is the right approach?
>
>-- Nico
>
>>
>> > + *
>> > + * Returns:
>> > + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
>> > + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> > + * PTE_CHECK_FAIL - Abort collapse scan
>> > + */
>> > +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>>
>> There's no need for inline in an internal static function in a file.
>>
>> > + unsigned long addr, struct folio **foliop, int *none_or_zero,
>> > + int *unmapped, int *shared, int *scan_result,
>> > + struct collapse_control *cc)
>> > +{
>> > + struct folio *folio = NULL;
>> > +
>> > + if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>> > + (*none_or_zero)++;
>> > + if (!userfaultfd_armed(vma) &&
>> > + (!cc->is_khugepaged ||
>> > + *none_or_zero <= khugepaged_max_ptes_none)) {
>> > + return PTE_CHECK_CONTINUE;
>> > + } else {
>> > + *scan_result = SCAN_EXCEED_NONE_PTE;
>> > + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > + } else if (!pte_present(pte)) {
>>
>> You're now making the if-else issues with previous patches worse by
>> returning which gets even checkpatch to warn you.
>>
>> if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>>
>> (of course note that I am not convinced you can drop the pte_present()
>> check here)
>>
>> (*none_or_zero)++;
>> if (!userfaultfd_armed(vma) &&
>> (!cc->is_khugepaged ||
>> *none_or_zero <= khugepaged_max_ptes_none))
>> return PTE_CHECK_CONTINUE;
>>
>> *scan_result = SCAN_EXCEED_NONE_PTE;
>> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> return PTE_CHECK_FAIL;
>> }
>>
>> if (!pte_present(pte)) {
>> ...
>> }
>>
>> But even that really needs seriously more refactoring - that condition
>> above is horrible for instance so, e.g.:
>>
>> if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>> (*none_or_zero)++;
>>
>> /* Cases where this is acceptable. */
>> if (!userfaultfd_armed(vma))
>> return PTE_CHECK_SKIP;
>> if (!cc->is_khugepaged)
>> return PTE_CHECK_SKIP;
>> if (*none_or_zero <= khugepaged_max_ptes_none)
>> return PTE_CHECK_SKIP;
>>
>> /* Otherwise, we must abort the scan. */
>> *scan_result = SCAN_EXCEED_NONE_PTE;
>> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> return PTE_CHECK_FAIL;
>> }
>>
>> if (!pte_present(pte)) {
>> ...
>> }
>>
>> Improves things a lot.
>>
>> I do however _hate_ this (*blah)++; thing. A helper struct would help :)
>>
>>
>>
>> > + if (!unmapped) {
>> > + *scan_result = SCAN_PTE_NON_PRESENT;
>> > + return PTE_CHECK_FAIL;
>>
>> Can't we have a helper that sets the result, e.g.:
>>
>> return pte_check_fail(scan_result, SCAN_PTE_NON_PRESENT);
>>
>> static enum pte_check_result pte_check_fail(int *scan_result,
>> enum pte_check_result result)
>> {
>> *scan_result = result;
>> return PTE_CHECK_FAIL;
>> }
>>
>> And same for success/skip
>>
>> Then all of these horrible if (blah) { *foo = bar; return baz; } can be
>> made into
>>
>> if (blah)
>> return pte_check_xxx(..., SCAN_PTE_...);
>>
>> > + }
>> > +
>> > + if (non_swap_entry(pte_to_swp_entry(pte))) {
>> > + *scan_result = SCAN_PTE_NON_PRESENT;
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > +
>> > + (*unmapped)++;
>> > + if (!cc->is_khugepaged ||
>> > + *unmapped <= khugepaged_max_ptes_swap) {
>> > + /*
>> > + * Always be strict with uffd-wp enabled swap
>> > + * entries. Please see comment below for
>> > + * pte_uffd_wp().
>> > + */
>> > + if (pte_swp_uffd_wp(pte)) {
>> > + *scan_result = SCAN_PTE_UFFD_WP;
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > + return PTE_CHECK_CONTINUE;
>> > + } else {
>> > + *scan_result = SCAN_EXCEED_SWAP_PTE;
>> > + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > + } else if (pte_uffd_wp(pte)) {
>> > + /*
>> > + * Don't collapse the page if any of the small PTEs are
>> > + * armed with uffd write protection. Here we can also mark
>> > + * the new huge pmd as write protected if any of the small
>> > + * ones is marked but that could bring unknown userfault
>> > + * messages that falls outside of the registered range.
>> > + * So, just be simple.
>> > + */
>> > + *scan_result = SCAN_PTE_UFFD_WP;
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > +
>>
>> Again as all of the comments for previous series still apply here so
>> obviously I don't like this as-is :)
>>
>> And as Andrew has pointed out, now checkpatch is finding the 'pointless
>> else' issue (as mentioned above also).
>>
>> > + folio = vm_normal_folio(vma, addr, pte);
>> > + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>> > + *scan_result = SCAN_PAGE_NULL;
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > +
>> > + if (!folio_test_anon(folio)) {
>> > + VM_WARN_ON_FOLIO(true, folio);
>> > + *scan_result = SCAN_PAGE_ANON;
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > +
>> > + /*
>> > + * We treat a single page as shared if any part of the THP
>> > + * is shared.
>> > + */
>> > + if (folio_maybe_mapped_shared(folio)) {
>> > + (*shared)++;
>> > + if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>> > + *scan_result = SCAN_EXCEED_SHARED_PTE;
>> > + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> > + return PTE_CHECK_FAIL;
>> > + }
>> > + }
>> > +
>> > + *foliop = folio;
>> > +
>> > + return PTE_CHECK_SUCCEED;
>> > +}
>> > +
>> > static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> > unsigned long start_addr,
>> > pte_t *pte,
>> > struct collapse_control *cc,
>> > struct list_head *compound_pagelist)
>> > {
>> > - struct page *page = NULL;
>> > struct folio *folio = NULL;
>> > unsigned long addr = start_addr;
>> > pte_t *_pte;
>> > int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>> > + int pte_check_res;
>> >
>> > for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> > _pte++, addr += PAGE_SIZE) {
>> > pte_t pteval = ptep_get(_pte);
>> > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> > - ++none_or_zero;
>> > - if (!userfaultfd_armed(vma) &&
>> > - (!cc->is_khugepaged ||
>> > - none_or_zero <= khugepaged_max_ptes_none)) {
>> > - continue;
>> > - } else {
>> > - result = SCAN_EXCEED_NONE_PTE;
>> > - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> > - goto out;
>> > - }
>> > - } else if (!pte_present(pteval)) {
>> > - result = SCAN_PTE_NON_PRESENT;
>> > - goto out;
>> > - } else if (pte_uffd_wp(pteval)) {
>> > - result = SCAN_PTE_UFFD_WP;
>> > - goto out;
>> > - }
>> > - page = vm_normal_page(vma, addr, pteval);
>> > - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> > - result = SCAN_PAGE_NULL;
>> > - goto out;
>> > - }
>> >
>> > - folio = page_folio(page);
>> > - if (!folio_test_anon(folio)) {
>> > - VM_WARN_ON_FOLIO(true, folio);
>> > - result = SCAN_PAGE_ANON;
>> > - goto out;
>> > - }
>> > + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> > + &folio, &none_or_zero, NULL, &shared,
>> > + &result, cc);
>> >
>> > - /* See hpage_collapse_scan_pmd(). */
>> > - if (folio_maybe_mapped_shared(folio)) {
>> > - ++shared;
>> > - if (cc->is_khugepaged &&
>> > - shared > khugepaged_max_ptes_shared) {
>> > - result = SCAN_EXCEED_SHARED_PTE;
>> > - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> > - goto out;
>> > - }
>> > - }
>> > + if (pte_check_res == PTE_CHECK_CONTINUE)
>> > + continue;
>> > + else if (pte_check_res == PTE_CHECK_FAIL)
>> > + goto out;
>> >
>> > if (folio_test_large(folio)) {
>> > struct folio *f;
>> > @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> > pte_t *pte, *_pte;
>> > int result = SCAN_FAIL, referenced = 0;
>> > int none_or_zero = 0, shared = 0;
>> > - struct page *page = NULL;
>> > struct folio *folio = NULL;
>> > unsigned long addr;
>> > spinlock_t *ptl;
>> > int node = NUMA_NO_NODE, unmapped = 0;
>> > + int pte_check_res;
>> >
>> > VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>> >
>> > @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> > for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> > _pte++, addr += PAGE_SIZE) {
>> > pte_t pteval = ptep_get(_pte);
>> > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> > - ++none_or_zero;
>> > - if (!userfaultfd_armed(vma) &&
>> > - (!cc->is_khugepaged ||
>> > - none_or_zero <= khugepaged_max_ptes_none)) {
>> > - continue;
>> > - } else {
>> > - result = SCAN_EXCEED_NONE_PTE;
>> > - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> > - goto out_unmap;
>> > - }
>> > - } else if (!pte_present(pteval)) {
>> > - if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> > - result = SCAN_PTE_NON_PRESENT;
>> > - goto out_unmap;
>> > - }
>> >
>> > - ++unmapped;
>> > - if (!cc->is_khugepaged ||
>> > - unmapped <= khugepaged_max_ptes_swap) {
>> > - /*
>> > - * Always be strict with uffd-wp
>> > - * enabled swap entries. Please see
>> > - * comment below for pte_uffd_wp().
>> > - */
>> > - if (pte_swp_uffd_wp(pteval)) {
>> > - result = SCAN_PTE_UFFD_WP;
>> > - goto out_unmap;
>> > - }
>> > - continue;
>> > - } else {
>> > - result = SCAN_EXCEED_SWAP_PTE;
>> > - count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> > - goto out_unmap;
>> > - }
>> > - } else if (pte_uffd_wp(pteval)) {
>> > - /*
>> > - * Don't collapse the page if any of the small
>> > - * PTEs are armed with uffd write protection.
>> > - * Here we can also mark the new huge pmd as
>> > - * write protected if any of the small ones is
>> > - * marked but that could bring unknown
>> > - * userfault messages that falls outside of
>> > - * the registered range. So, just be simple.
>> > - */
>> > - result = SCAN_PTE_UFFD_WP;
>> > - goto out_unmap;
>> > - }
>> > + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> > + &folio, &none_or_zero, &unmapped,
>> > + &shared, &result, cc);
>> >
>> > - page = vm_normal_page(vma, addr, pteval);
>> > - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> > - result = SCAN_PAGE_NULL;
>> > - goto out_unmap;
>> > - }
>> > - folio = page_folio(page);
>> > -
>> > - if (!folio_test_anon(folio)) {
>> > - VM_WARN_ON_FOLIO(true, folio);
>> > - result = SCAN_PAGE_ANON;
>> > + if (pte_check_res == PTE_CHECK_CONTINUE)
>> > + continue;
>> > + else if (pte_check_res == PTE_CHECK_FAIL)
>> > goto out_unmap;
>> > - }
>> > -
>> > - /*
>> > - * We treat a single page as shared if any part of the THP
>> > - * is shared.
>> > - */
>> > - if (folio_maybe_mapped_shared(folio)) {
>> > - ++shared;
>> > - if (cc->is_khugepaged &&
>> > - shared > khugepaged_max_ptes_shared) {
>> > - result = SCAN_EXCEED_SHARED_PTE;
>> > - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> > - goto out_unmap;
>> > - }
>> > - }
>> >
>> > /*
>> > * Record which node the original page is from and save this
>> > --
>> > 2.49.0
>> >
>>
>
>
next prev parent reply other threads:[~2026-06-18 13:10 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-08 4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
2025-10-14 12:17 ` Lorenzo Stoakes
2025-10-14 12:27 ` David Hildenbrand
2025-10-15 4:49 ` Lance Yang
2025-10-15 9:16 ` Lorenzo Stoakes
2025-10-15 9:31 ` Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
2025-10-14 12:25 ` Lorenzo Stoakes
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2025-10-09 1:07 ` Andrew Morton
2025-10-09 1:49 ` Lance Yang
2025-10-10 9:10 ` Dev Jain
2025-10-10 10:42 ` Lance Yang
2025-10-10 13:29 ` Wei Yang
2025-10-10 13:55 ` Lance Yang
2025-10-14 12:36 ` Lorenzo Stoakes
2026-06-18 12:22 ` Nico Pache
2026-06-18 13:10 ` Lance Yang [this message]
2026-06-18 13:19 ` Nico Pache
2026-06-18 14:34 ` Lorenzo Stoakes
2025-10-14 17:41 ` Lorenzo Stoakes
2025-10-15 1:48 ` Lance Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260618131024.69667-1-lance.yang@linux.dev \
--to=lance.yang@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=richard.weiyang@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox