Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: npache@redhat.com
Cc: lance.yang@linux.dev, akpm@linux-foundation.org, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
	richard.weiyang@gmail.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, david@kernel.org, ljs@kernel.org
Subject: Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
Date: Thu, 18 Jun 2026 21:10:24 +0800	[thread overview]
Message-ID: <20260618131024.69667-1-lance.yang@linux.dev> (raw)
In-Reply-To: <CAA1CXcDv7WVxMcGv9OOZDzTzzcfEAnaGbX5dJGwqN5mSWkmRvA@mail.gmail.com>

+Cc David and Lorenzo, address oops fixed :)

On Thu, Jun 18, 2026 at 06:22:49AM -0600, Nico Pache wrote:
>On Tue, Oct 14, 2025 at 6:37 AM Lorenzo Stoakes
><lorenzo.stoakes@oracle.com> wrote:
>>
>> On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>> > From: Lance Yang <lance.yang@linux.dev>
>> >
>> > As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>> > and __collapse_huge_page_isolate() was almost duplicated.
>> >
>> > This patch cleans things up by moving all the common PTE checking logic
>> > into a new shared helper, thp_collapse_check_pte(). While at it, we use
>> > vm_normal_folio() instead of vm_normal_page().
>> >
>> > Suggested-by: David Hildenbrand <david@redhat.com>
>> > Suggested-by: Dev Jain <dev.jain@arm.com>
>> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>
>> In general I like the idea, the implementation needs work though.
>>
>> Will check in more detail on respin
>>
>> > ---
>> >  mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>> >  1 file changed, 130 insertions(+), 113 deletions(-)
>> >
>> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> > index b5c0295c3414..7116caae1fa4 100644
>> > --- a/mm/khugepaged.c
>> > +++ b/mm/khugepaged.c
>> > @@ -61,6 +61,12 @@ enum scan_result {
>> >       SCAN_PAGE_FILLED,
>> >  };
>> >
>> > +enum pte_check_result {
>> > +     PTE_CHECK_SUCCEED,
>> > +     PTE_CHECK_CONTINUE,
>>
>> Don't love this logic - this feels like we're essentially abstracting
>> control flow, I mean what does 'continue' mean here? Other than you're in a
>> loop and please continue, which is relying a little too much on what the
>> caller does.
>>
>> if we retain this logic something like PTE_CHECK_SKIP would make more sense.
>>
>>
>> > +     PTE_CHECK_FAIL,
>> > +};
>> > +
>> >  #define CREATE_TRACE_POINTS
>> >  #include <trace/events/huge_memory.h>
>> >
>> > @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>> >       }
>> >  }
>> >
>> > +/*
>> > + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> > + * @pte:           The PTE to check
>> > + * @vma:           The VMA the PTE belongs to
>> > + * @addr:          The virtual address corresponding to this PTE
>> > + * @foliop:        On success, used to return a pointer to the folio
>> > + *                 Must be non-NULL
>> > + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
>> > + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
>> > + * @shared:        Counter for shared pages. Must be non-NULL
>> > + * @scan_result:   Used to return the failure reason (SCAN_*) on a
>> > + *                 PTE_CHECK_FAIL return. Must be non-NULL
>> > + * @cc:            Collapse control settings
>>
>> Do we really need this many parameters? THis is hard to follow.
>>
>> Of course it being me, I'd immediately prefer a helper struct :)
>
>Hey!
>
>Ive been trying to reimplement this patch, and started converting this
>to a helper struct... At first glance it seems like the right
>approach, but because the scan/isolate share two slightly different
>implementations, we need to leave some logic in the parent. This
>results in polluting the rest of the code significantly just to save
>one ugly function call with many parameters.

Been a while ... so I don't fully recall anymore :)

If you already have a patch or series, maybe just post it and we can look
at the actual diff.

Probably easier than talking about it without seeing the code :D

Cheers, Lance

>
>not sure what to do about that tbh.
>
>e.g., (just one of many examples)
>
>
>-                               NR_ISOLATED_ANON + folio_is_file_lru(folio),
>-                               folio_nr_pages(folio));
>-               VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>-               VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>+                               NR_ISOLATED_ANON + folio_is_file_lru(ctx.folio),
>+                               folio_nr_pages(ctx.folio));
>+               VM_BUG_ON_FOLIO(!folio_test_locked(ctx.folio), ctx.folio);
>+               VM_BUG_ON_FOLIO(folio_test_lru(ctx.folio), ctx.folio);
>
>-               if (folio_test_large(folio))
>-                       list_add_tail(&folio->lru, compound_pagelist);
>+               if (folio_test_large(ctx.folio))
>+                       list_add_tail(ctx.folio->lru, compound_pagelist);
>
>Are we sure this helper structure is the right approach?
>
>-- Nico
>
>>
>> > + *
>> > + * Returns:
>> > + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
>> > + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> > + *   PTE_CHECK_FAIL     - Abort collapse scan
>> > + */
>> > +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>>
>> There's no need for inline in an internal static function in a file.
>>
>> > +             unsigned long addr, struct folio **foliop, int *none_or_zero,
>> > +             int *unmapped, int *shared, int *scan_result,
>> > +             struct collapse_control *cc)
>> > +{
>> > +     struct folio *folio = NULL;
>> > +
>> > +     if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>> > +             (*none_or_zero)++;
>> > +             if (!userfaultfd_armed(vma) &&
>> > +                 (!cc->is_khugepaged ||
>> > +                  *none_or_zero <= khugepaged_max_ptes_none)) {
>> > +                     return PTE_CHECK_CONTINUE;
>> > +             } else {
>> > +                     *scan_result = SCAN_EXCEED_NONE_PTE;
>> > +                     count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> > +                     return PTE_CHECK_FAIL;
>> > +             }
>> > +     } else if (!pte_present(pte)) {
>>
>> You're now making the if-else issues with previous patches worse by
>> returning which gets even checkpatch to warn you.
>>
>>         if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>>
>> (of course note that I am not convinced you can drop the pte_present()
>> check here)
>>
>>                 (*none_or_zero)++;
>>                 if (!userfaultfd_armed(vma) &&
>>                     (!cc->is_khugepaged ||
>>                      *none_or_zero <= khugepaged_max_ptes_none))
>>                         return PTE_CHECK_CONTINUE;
>>
>>                 *scan_result = SCAN_EXCEED_NONE_PTE;
>>                 count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>                 return PTE_CHECK_FAIL;
>>         }
>>
>>         if (!pte_present(pte)) {
>>                 ...
>>         }
>>
>> But even that really needs seriously more refactoring - that condition
>> above is horrible for instance so, e.g.:
>>
>>         if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>>                 (*none_or_zero)++;
>>
>>                 /* Cases where this is acceptable. */
>>                 if (!userfaultfd_armed(vma))
>>                         return PTE_CHECK_SKIP;
>>                 if (!cc->is_khugepaged)
>>                         return PTE_CHECK_SKIP;
>>                 if (*none_or_zero <= khugepaged_max_ptes_none)
>>                         return PTE_CHECK_SKIP;
>>
>>                 /* Otherwise, we must abort the scan. */
>>                 *scan_result = SCAN_EXCEED_NONE_PTE;
>>                 count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>                 return PTE_CHECK_FAIL;
>>         }
>>
>>         if (!pte_present(pte)) {
>>                 ...
>>         }
>>
>> Improves things a lot.
>>
>> I do however _hate_ this (*blah)++; thing. A helper struct would help :)
>>
>>
>>
>> > +             if (!unmapped) {
>> > +                     *scan_result = SCAN_PTE_NON_PRESENT;
>> > +                     return PTE_CHECK_FAIL;
>>
>> Can't we have a helper that sets the result, e.g.:
>>
>>         return pte_check_fail(scan_result, SCAN_PTE_NON_PRESENT);
>>
>> static enum pte_check_result pte_check_fail(int *scan_result,
>>                 enum pte_check_result result)
>> {
>>         *scan_result = result;
>>         return PTE_CHECK_FAIL;
>> }
>>
>> And same for success/skip
>>
>> Then all of these horrible if (blah) { *foo = bar; return baz; } can be
>> made into
>>
>>         if (blah)
>>                 return pte_check_xxx(..., SCAN_PTE_...);
>>
>> > +             }
>> > +
>> > +             if (non_swap_entry(pte_to_swp_entry(pte))) {
>> > +                     *scan_result = SCAN_PTE_NON_PRESENT;
>> > +                     return PTE_CHECK_FAIL;
>> > +             }
>> > +
>> > +             (*unmapped)++;
>> > +             if (!cc->is_khugepaged ||
>> > +                 *unmapped <= khugepaged_max_ptes_swap) {
>> > +                     /*
>> > +                      * Always be strict with uffd-wp enabled swap
>> > +                      * entries. Please see comment below for
>> > +                      * pte_uffd_wp().
>> > +                      */
>> > +                     if (pte_swp_uffd_wp(pte)) {
>> > +                             *scan_result = SCAN_PTE_UFFD_WP;
>> > +                             return PTE_CHECK_FAIL;
>> > +                     }
>> > +                     return PTE_CHECK_CONTINUE;
>> > +             } else {
>> > +                     *scan_result = SCAN_EXCEED_SWAP_PTE;
>> > +                     count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> > +                     return PTE_CHECK_FAIL;
>> > +             }
>> > +     } else if (pte_uffd_wp(pte)) {
>> > +             /*
>> > +              * Don't collapse the page if any of the small PTEs are
>> > +              * armed with uffd write protection. Here we can also mark
>> > +              * the new huge pmd as write protected if any of the small
>> > +              * ones is marked but that could bring unknown userfault
>> > +              * messages that falls outside of the registered range.
>> > +              * So, just be simple.
>> > +              */
>> > +             *scan_result = SCAN_PTE_UFFD_WP;
>> > +             return PTE_CHECK_FAIL;
>> > +     }
>> > +
>>
>> Again as all of the comments for previous series still apply here so
>> obviously I don't like this as-is :)
>>
>> And as Andrew has pointed out, now checkpatch is finding the 'pointless
>> else' issue (as mentioned above also).
>>
>> > +     folio = vm_normal_folio(vma, addr, pte);
>> > +     if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>> > +             *scan_result = SCAN_PAGE_NULL;
>> > +             return PTE_CHECK_FAIL;
>> > +     }
>> > +
>> > +     if (!folio_test_anon(folio)) {
>> > +             VM_WARN_ON_FOLIO(true, folio);
>> > +             *scan_result = SCAN_PAGE_ANON;
>> > +             return PTE_CHECK_FAIL;
>> > +     }
>> > +
>> > +     /*
>> > +      * We treat a single page as shared if any part of the THP
>> > +      * is shared.
>> > +      */
>> > +     if (folio_maybe_mapped_shared(folio)) {
>> > +             (*shared)++;
>> > +             if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>> > +                     *scan_result = SCAN_EXCEED_SHARED_PTE;
>> > +                     count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> > +                     return PTE_CHECK_FAIL;
>> > +             }
>> > +     }
>> > +
>> > +     *foliop = folio;
>> > +
>> > +     return PTE_CHECK_SUCCEED;
>> > +}
>> > +
>> >  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> >                                       unsigned long start_addr,
>> >                                       pte_t *pte,
>> >                                       struct collapse_control *cc,
>> >                                       struct list_head *compound_pagelist)
>> >  {
>> > -     struct page *page = NULL;
>> >       struct folio *folio = NULL;
>> >       unsigned long addr = start_addr;
>> >       pte_t *_pte;
>> >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>> > +     int pte_check_res;
>> >
>> >       for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> >            _pte++, addr += PAGE_SIZE) {
>> >               pte_t pteval = ptep_get(_pte);
>> > -             if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> > -                     ++none_or_zero;
>> > -                     if (!userfaultfd_armed(vma) &&
>> > -                         (!cc->is_khugepaged ||
>> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
>> > -                             continue;
>> > -                     } else {
>> > -                             result = SCAN_EXCEED_NONE_PTE;
>> > -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> > -                             goto out;
>> > -                     }
>> > -             } else if (!pte_present(pteval)) {
>> > -                     result = SCAN_PTE_NON_PRESENT;
>> > -                     goto out;
>> > -             } else if (pte_uffd_wp(pteval)) {
>> > -                     result = SCAN_PTE_UFFD_WP;
>> > -                     goto out;
>> > -             }
>> > -             page = vm_normal_page(vma, addr, pteval);
>> > -             if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> > -                     result = SCAN_PAGE_NULL;
>> > -                     goto out;
>> > -             }
>> >
>> > -             folio = page_folio(page);
>> > -             if (!folio_test_anon(folio)) {
>> > -                     VM_WARN_ON_FOLIO(true, folio);
>> > -                     result = SCAN_PAGE_ANON;
>> > -                     goto out;
>> > -             }
>> > +             pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> > +                                     &folio, &none_or_zero, NULL, &shared,
>> > +                                     &result, cc);
>> >
>> > -             /* See hpage_collapse_scan_pmd(). */
>> > -             if (folio_maybe_mapped_shared(folio)) {
>> > -                     ++shared;
>> > -                     if (cc->is_khugepaged &&
>> > -                         shared > khugepaged_max_ptes_shared) {
>> > -                             result = SCAN_EXCEED_SHARED_PTE;
>> > -                             count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> > -                             goto out;
>> > -                     }
>> > -             }
>> > +             if (pte_check_res == PTE_CHECK_CONTINUE)
>> > +                     continue;
>> > +             else if (pte_check_res == PTE_CHECK_FAIL)
>> > +                     goto out;
>> >
>> >               if (folio_test_large(folio)) {
>> >                       struct folio *f;
>> > @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> >       pte_t *pte, *_pte;
>> >       int result = SCAN_FAIL, referenced = 0;
>> >       int none_or_zero = 0, shared = 0;
>> > -     struct page *page = NULL;
>> >       struct folio *folio = NULL;
>> >       unsigned long addr;
>> >       spinlock_t *ptl;
>> >       int node = NUMA_NO_NODE, unmapped = 0;
>> > +     int pte_check_res;
>> >
>> >       VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>> >
>> > @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> >       for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> >            _pte++, addr += PAGE_SIZE) {
>> >               pte_t pteval = ptep_get(_pte);
>> > -             if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> > -                     ++none_or_zero;
>> > -                     if (!userfaultfd_armed(vma) &&
>> > -                         (!cc->is_khugepaged ||
>> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
>> > -                             continue;
>> > -                     } else {
>> > -                             result = SCAN_EXCEED_NONE_PTE;
>> > -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> > -                             goto out_unmap;
>> > -                     }
>> > -             } else if (!pte_present(pteval)) {
>> > -                     if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> > -                             result = SCAN_PTE_NON_PRESENT;
>> > -                             goto out_unmap;
>> > -                     }
>> >
>> > -                     ++unmapped;
>> > -                     if (!cc->is_khugepaged ||
>> > -                         unmapped <= khugepaged_max_ptes_swap) {
>> > -                             /*
>> > -                              * Always be strict with uffd-wp
>> > -                              * enabled swap entries.  Please see
>> > -                              * comment below for pte_uffd_wp().
>> > -                              */
>> > -                             if (pte_swp_uffd_wp(pteval)) {
>> > -                                     result = SCAN_PTE_UFFD_WP;
>> > -                                     goto out_unmap;
>> > -                             }
>> > -                             continue;
>> > -                     } else {
>> > -                             result = SCAN_EXCEED_SWAP_PTE;
>> > -                             count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> > -                             goto out_unmap;
>> > -                     }
>> > -             } else if (pte_uffd_wp(pteval)) {
>> > -                     /*
>> > -                      * Don't collapse the page if any of the small
>> > -                      * PTEs are armed with uffd write protection.
>> > -                      * Here we can also mark the new huge pmd as
>> > -                      * write protected if any of the small ones is
>> > -                      * marked but that could bring unknown
>> > -                      * userfault messages that falls outside of
>> > -                      * the registered range.  So, just be simple.
>> > -                      */
>> > -                     result = SCAN_PTE_UFFD_WP;
>> > -                     goto out_unmap;
>> > -             }
>> > +             pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> > +                                     &folio, &none_or_zero, &unmapped,
>> > +                                     &shared, &result, cc);
>> >
>> > -             page = vm_normal_page(vma, addr, pteval);
>> > -             if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> > -                     result = SCAN_PAGE_NULL;
>> > -                     goto out_unmap;
>> > -             }
>> > -             folio = page_folio(page);
>> > -
>> > -             if (!folio_test_anon(folio)) {
>> > -                     VM_WARN_ON_FOLIO(true, folio);
>> > -                     result = SCAN_PAGE_ANON;
>> > +             if (pte_check_res == PTE_CHECK_CONTINUE)
>> > +                     continue;
>> > +             else if (pte_check_res == PTE_CHECK_FAIL)
>> >                       goto out_unmap;
>> > -             }
>> > -
>> > -             /*
>> > -              * We treat a single page as shared if any part of the THP
>> > -              * is shared.
>> > -              */
>> > -             if (folio_maybe_mapped_shared(folio)) {
>> > -                     ++shared;
>> > -                     if (cc->is_khugepaged &&
>> > -                         shared > khugepaged_max_ptes_shared) {
>> > -                             result = SCAN_EXCEED_SHARED_PTE;
>> > -                             count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> > -                             goto out_unmap;
>> > -                     }
>> > -             }
>> >
>> >               /*
>> >                * Record which node the original page is from and save this
>> > --
>> > 2.49.0
>> >
>>
>
>


  reply	other threads:[~2026-06-18 13:10 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-08  4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
2025-10-08  4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
2025-10-14 12:17   ` Lorenzo Stoakes
2025-10-14 12:27     ` David Hildenbrand
2025-10-15  4:49       ` Lance Yang
2025-10-15  9:16         ` Lorenzo Stoakes
2025-10-15  9:31           ` Lance Yang
2025-10-08  4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
2025-10-14 12:25   ` Lorenzo Stoakes
2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2025-10-09  1:07   ` Andrew Morton
2025-10-09  1:49     ` Lance Yang
2025-10-10  9:10   ` Dev Jain
2025-10-10 10:42     ` Lance Yang
2025-10-10 13:29   ` Wei Yang
2025-10-10 13:55     ` Lance Yang
2025-10-14 12:36   ` Lorenzo Stoakes
2026-06-18 12:22     ` Nico Pache
2026-06-18 13:10       ` Lance Yang [this message]
2026-06-18 13:19         ` Nico Pache
2026-06-18 14:34           ` Lorenzo Stoakes
2025-10-14 17:41   ` Lorenzo Stoakes
2025-10-15  1:48     ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260618131024.69667-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=npache@redhat.com \
    --cc=richard.weiyang@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox