Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Harry Yoo <harry.yoo@oracle.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: jackmanb@google.com, hannes@cmpxchg.org, linmiaohe@huawei.com,
	ziy@nvidia.com, willy@infradead.org, nao.horiguchi@gmail.com,
	david@redhat.com, lorenzo.stoakes@oracle.com,
	william.roche@oracle.com, tony.luck@intel.com,
	wangkefeng.wang@huawei.com, jane.chu@oracle.com,
	akpm@linux-foundation.org, osalvador@suse.de,
	muchun.song@linux.dev, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com
Subject: Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
Date: Mon, 29 Dec 2025 10:15:36 +0900	[thread overview]
Message-ID: <aVHWOMa9n40_8Yu3@hyeyoo> (raw)
In-Reply-To: <CACw3F52hZp+xC51bK=UfrfdAWdFEEqhsiJY2+0fzCLzB_3=XYg@mail.gmail.com>

On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > > folio becomes non-HugeTLB, it is released to buddy allocator
> > > as a high-order folio, e.g. a folio that contains 262144 pages
> > > if the folio was a 1G HugeTLB hugepage.
> > >
> > > This is problematic if the HugeTLB hugepage contained HWPoison
> > > subpages. In that case, since buddy allocator does not check
> > > HWPoison for non-zero-order folio, the raw HWPoison page can
> > > be given out with its buddy page and be re-used by either
> > > kernel or userspace.
> > >
> > > Memory failure recovery (MFR) in kernel does attempt to take
> > > raw HWPoison page off buddy allocator after
> > > dissolve_free_hugetlb_folio. However, there is always a time
> > > window between dissolve_free_hugetlb_folio frees a HWPoison
> > > high-order folio to buddy allocator and MFR takes HWPoison
> > > raw page off buddy allocator.
> > >
> > > One obvious way to avoid this problem is to add page sanity
> > > checks in page allocate or free path. However, it is against
> > > the past efforts to reduce sanity check overhead [1,2,3].
> > >
> > > Introduce free_has_hwpoison_pages to only free the healthy
> > > pages and excludes the HWPoison ones in the high-order folio.
> > > The idea is to iterate through the sub-pages of the folio to
> > > identify contiguous ranges of healthy pages. Instead of freeing
> > > pages one by one, decompose healthy ranges into the largest
> > > possible blocks. Each block meets the requirements to be freed
> > > to buddy allocator (__free_frozen_pages).
> > >
> > > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > > number of pages in the folio. While the power-of-two decomposition
> > > ensures that the number of calls to the buddy allocator is
> > > logarithmic for each contiguous healthy range, the mandatory
> > > linear scan of pages to identify PageHWPoison defines the
> > > overall time complexity.
> >
> > Hi Jiaqi, thanks for the patch!
> 
> Thanks for your review/comments!
> 
> >
> > Have you tried measuring the latency of free_has_hwpoison_pages() when
> > a few pages in a 1GB folio are hwpoisoned?
> >
> > Just wanted to make sure we don't introduce a possible soft lockup...
> > Or am I worrying too much?
> 
> In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
> I never run into a soft lockup. The 8-HWPoison-page case takes more
> time than other cases, meaning that handling the additional HWPoison
> page adds to the time cost.
> 
> After adding some instrument code, 10 sample runs of
> free_has_hwpoison_pages with 8 HWPoison pages:
> - observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
> - observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)
> 
> In comparison, freeing a 1G folio without any HWPoison pages 10 times
> (with same kernel config):
> - observed mean is 3.39 ms
> - observed standard deviation is 0.16ms

Thanks for the measurement!
 
> So it's around twice the baseline. It should be far from triggering a
> soft lockup, and the cost seems fair for handling exceptional hardware
> memory errors.

Yeah it looks fine to me.

> I can add these measurements in future revisions.

That would be nice, thanks.

> > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 101 insertions(+)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 822e05f1a9646..20c8862ce594e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> > >       }
> > >  }
> > >
> > > +static void prepare_compound_page_to_free(struct page *new_head,
> > > +                                       unsigned int order,
> > > +                                       unsigned long flags)
> > > +{
> > > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > +     new_head->mapping = NULL;
> > > +     new_head->private = 0;
> > > +
> > > +     clear_compound_head(new_head);
> > > +     if (order)
> > > +             prep_compound_page(new_head, order);
> > > +}
> >
> > Not sure why it's building compound pages, just to decompose them
> > when freeing via __free_frozen_pages()?
> 
> prepare_compound_page_to_free() borrowed the idea from
> __split_folio_to_order(). Conceptually the original folio is split
> into new compound pages with different orders;

I see, and per the previous discussion we don't want to split it
to 262,144 4K pages in the future, anyway...

> here this is done on
> the fly in free_contiguous_pages() when order is decided.
>
> > > +/*
> > > + * Given a range of pages physically contiguous physical, efficiently
> > > + * free them in blocks that meet __free_frozen_pages's requirements.
> > > + */
> > > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > > +                               unsigned long flags)
> > > +{
> > > +     unsigned int order;
> > > +     unsigned int align_order;
> > > +     unsigned int size_order;
> > > +     unsigned long pfn;
> > > +     unsigned long end_pfn = page_to_pfn(next);
> > > +     unsigned long remaining;
> > > +
> > > +     /*
> > > +      * This decomposition algorithm at every iteration chooses the
> > > +      * order to be the minimum of two constraints:
> > > +      * - Alignment: the largest power-of-two that divides current pfn.
> > > +      * - Size: the largest power-of-two that fits in the
> > > +      *   current remaining number of pages.
> > > +      */
> > > +     while (curr < next) {
> > > +             pfn = page_to_pfn(curr);
> > > +             remaining = end_pfn - pfn;
> > > +
> > > +             align_order = ffs(pfn) - 1;
> > > +             size_order = fls_long(remaining) - 1;
> > > +             order = min(align_order, size_order);
> > > +
> > > +             prepare_compound_page_to_free(curr, order, flags);
> > > +             __free_frozen_pages(curr, order, FPI_NONE);
> > > +             curr += (1UL << order);
> > > +     }
> > > +
> > > +     VM_WARN_ON(curr != next);
> > > +}
> > > +
> > > +/*
> > > + * Given a high-order compound page containing certain number of HWPoison
> > > + * pages, free only the healthy ones to buddy allocator.
> > > + *
> > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > + * overhead. So only use this when compound page really contains HWPoison.
> > > + *
> > > + * This implementation doesn't work in memdesc world.
> > > + */
> > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > +{
> > > +     struct page *curr = page;
> > > +     struct page *end = page + (1 << order);
> > > +     struct page *next;
> > > +     unsigned long flags = page->flags.f;
> > > +     unsigned long nr_pages;
> > > +     unsigned long total_freed = 0;
> > > +     unsigned long total_hwp = 0;
> > > +
> > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > +
> > > +     while (curr < end) {
> > > +             next = curr;
> > > +             nr_pages = 0;
> > > +
> > > +             while (next < end && !PageHWPoison(next)) {
> > > +                     ++next;
> > > +                     ++nr_pages;
> > > +             }
> > > +
> > > +             if (PageHWPoison(next))
> > > +                     ++total_hwp;
> > > +
> > > +             free_contiguous_pages(curr, next, flags);
> >
> > page_owner, memory profiling (anything else?) will be confused
> > because it was allocated as a larger size, but we're freeing only
> > some portion of it.
> 
> I am not sure, but looking at __split_unmapped_folio, it calls
> pgalloc_tag_split(folio, old_order, split_order) when splitting an
> old_order-order folio into a new split_order.
> 
> Maybe prepare_compound_page_to_free really should
> update_page_tag_ref(), I need to take a closer look at this with
> CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> 
> > Perhaps we need to run some portion of this code snippet
> > (from free_pages_prepare()), before freeing portions of it:
> >
> >         page_cpupid_reset_last(page);
> >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >         reset_page_owner(page, order);
> >         page_table_check_free(page, order);
> >         pgalloc_tag_sub(page, 1 << order);
> 
> Since they come from free_pages_prepare, I believe these lines are
> already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> free_pages_prepare(), right? Or am I missing something?

But they're called with order that is smaller than the original order.
That's could be problematic; for example, memory profiling stores metadata
only on the first page. If you pass anything other than the first page
to free_pages_prepare(), it will not recognize that metadata was stored
during allocation.

In general, I think they're not designed to handle cases where
the allocation order and the free order differ (unless we split
metadata like __split_unmapped_folio() does).

> > > +             total_freed += nr_pages;
> > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > +     }
> > > +
> > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > +}
> > > +
> > >  void free_frozen_pages(struct page *page, unsigned int order)
> > >  {
> > > +     struct folio *folio = page_folio(page);
> > > +
> > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > +             folio_clear_has_hwpoisoned(folio);
> > > +             free_has_hwpoison_pages(page, order);
> > > +             return;
> > > +     }
> > > +
> >
> > It feels like it's a bit random place to do has_hwpoisoned check.
> > Can we move this to free_pages_prepare() where we have some
> > sanity checks (and also order-0 hwpoison page handling)?
> 
> While free_pages_prepare() seems to be a better place to do the
> has_hwpoisoned check, it is not a good place to do
> free_has_hwpoison_pages().

Why is it not a good place for free_has_hwpoison_pages()?

Callers of free_pages_prepare() are supposed to avoid freeing it back to
the buddy or using the page when it returns false.

...except compaction_free(), which I don't have much idea what it's
doing.

> > >       __free_frozen_pages(page, order, FPI_NONE);
> > >  }

-- 
Cheers,
Harry / Hyeonggon

next prev parent reply	other threads:[~2025-12-29  1:16 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19 18:33 [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-23  5:14   ` Harry Yoo
2025-12-27  1:50     ` Jiaqi Yan
2025-12-29  1:15       ` Harry Yoo [this message]
2025-12-31  0:19         ` Jiaqi Yan
2025-12-31  4:37           ` Harry Yoo
2026-01-06  3:40             ` Jiaqi Yan
2026-01-06  7:35               ` Harry Yoo
2026-01-10  5:48                 ` Jiaqi Yan
2025-12-23  7:45   ` Miaohe Lin
2025-12-27  1:50     ` Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
2025-12-22 22:12   ` Matthew Wilcox
2025-12-22 23:13     ` Jiaqi Yan
2025-12-22 22:06 ` [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aVHWOMa9n40_8Yu3@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=duenwen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=jane.chu@oracle.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=tony.luck@intel.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=william.roche@oracle.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.