From: Harry Yoo <harry.yoo@oracle.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: jackmanb@google.com, hannes@cmpxchg.org, linmiaohe@huawei.com,
ziy@nvidia.com, willy@infradead.org, nao.horiguchi@gmail.com,
david@redhat.com, lorenzo.stoakes@oracle.com,
william.roche@oracle.com, tony.luck@intel.com,
wangkefeng.wang@huawei.com, jane.chu@oracle.com,
akpm@linux-foundation.org, osalvador@suse.de,
muchun.song@linux.dev, rientjes@google.com, duenwen@google.com,
jthoughton@google.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com
Subject: Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
Date: Wed, 31 Dec 2025 13:37:32 +0900 [thread overview]
Message-ID: <aVSojEWdFPsb1fg7@hyeyoo> (raw)
In-Reply-To: <CACw3F53p9TyAZs2pn+aHo85=aZx0+unaoLjeFtEK8DJ8A5TD4A@mail.gmail.com>
On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > > ---
> > > > > mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 101 insertions(+)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 822e05f1a9646..20c8862ce594e 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > +/*
> > > > > + * Given a high-order compound page containing certain number of HWPoison
> > > > > + * pages, free only the healthy ones to buddy allocator.
> > > > > + *
> > > > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > > > + * overhead. So only use this when compound page really contains HWPoison.
> > > > > + *
> > > > > + * This implementation doesn't work in memdesc world.
> > > > > + */
> > > > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > > > +{
> > > > > + struct page *curr = page;
> > > > > + struct page *end = page + (1 << order);
> > > > > + struct page *next;
> > > > > + unsigned long flags = page->flags.f;
> > > > > + unsigned long nr_pages;
> > > > > + unsigned long total_freed = 0;
> > > > > + unsigned long total_hwp = 0;
> > > > > +
> > > > > + VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > > > +
> > > > > + while (curr < end) {
> > > > > + next = curr;
> > > > > + nr_pages = 0;
> > > > > +
> > > > > + while (next < end && !PageHWPoison(next)) {
> > > > > + ++next;
> > > > > + ++nr_pages;
> > > > > + }
> > > > > +
> > > > > + if (PageHWPoison(next))
> > > > > + ++total_hwp;
> > > > > +
> > > > > + free_contiguous_pages(curr, next, flags);
> > > >
> > > > page_owner, memory profiling (anything else?) will be confused
> > > > because it was allocated as a larger size, but we're freeing only
> > > > some portion of it.
> > >
> > > I am not sure, but looking at __split_unmapped_folio, it calls
> > > pgalloc_tag_split(folio, old_order, split_order) when splitting an
> > > old_order-order folio into a new split_order.
> > >
> > > Maybe prepare_compound_page_to_free really should
> > > update_page_tag_ref(), I need to take a closer look at this with
> > > CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> > >
> > > > Perhaps we need to run some portion of this code snippet
> > > > (from free_pages_prepare()), before freeing portions of it:
> > > >
> > > > page_cpupid_reset_last(page);
> > > > page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > > reset_page_owner(page, order);
> > > > page_table_check_free(page, order);
> > > > pgalloc_tag_sub(page, 1 << order);
> > >
> > > Since they come from free_pages_prepare, I believe these lines are
> > > already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> > > free_pages_prepare(), right? Or am I missing something?
> >
> > But they're called with order that is smaller than the original order.
> > That's could be problematic; for example, memory profiling stores metadata
> > only on the first page. If you pass anything other than the first page
> > to free_pages_prepare(), it will not recognize that metadata was stored
> > during allocation.
> >
>
> Right, with MEM_ALLOC_PROFILING enabled, I ran into the following
> WARNING when freeing all blocks except the 1st one (which contains the
> original head page):
>
> [ 2101.713669] ------------[ cut here ]------------
> [ 2101.713670] alloc_tag was not set
> [ 2101.713671] WARNING: ./include/linux/alloc_tag.h:164 at
> __pgalloc_tag_sub+0xdf/0x160, CPU#18: hugetlb-mfr-3pa/33675
> [ 2101.713693] CPU: 18 UID: 0 PID: 33675 Comm: hugetlb-mfr-3pa
> Tainted: G S W O 6.19.0-smp-DEV #2 NONE
> [ 2101.713698] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE
> [ 2101.713702] RIP: 0010:__pgalloc_tag_sub+0xdf/0x160
> ...
> [ 2101.713723] Call Trace:
> [ 2101.713725] <TASK>
> [ 2101.713727] free_has_hwpoison_pages+0xbc/0x370
> [ 2101.713731] free_frozen_pages+0xb3/0x100
> [ 2101.713733] __folio_put+0xd5/0x100
> [ 2101.713739] dissolve_free_hugetlb_folio+0x17f/0x1a0
> [ 2101.713743] filemap_offline_hwpoison_folio+0x193/0x4c0
> [ 2101.713747] ? __pfx_workingset_update_node+0x10/0x10
> [ 2101.713751] remove_inode_hugepages+0x209/0x690
> [ 2101.713757] ? on_each_cpu_cond_mask+0x1a/0x20
> [ 2101.713760] ? __cond_resched+0x23/0x60
> [ 2101.713768] ? n_tty_write+0x4c7/0x500
> [ 2101.713773] hugetlbfs_setattr+0x127/0x170
> [ 2101.713776] notify_change+0x32e/0x390
> [ 2101.713781] do_ftruncate+0x12c/0x1a0
> [ 2101.713786] __x64_sys_ftruncate+0x3e/0x70
> [ 2101.713789] do_syscall_64+0x6f/0x890
> [ 2101.713792] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 2101.713811] </TASK>
> [ 2101.713812] ---[ end trace 0000000000000000 ]---
>
> This is because in free_pages_prepare(), pgalloc_tag_sub() found there
> is no alloc tag on the compound page being freeing.
>
> > In general, I think they're not designed to handle cases where
> > the allocation order and the free order differ (unless we split
> > metadata like __split_unmapped_folio() does).
>
> I believe the proper way to fix this is to do something similar to
> pgalloc_tag_split(), used by __split_unmapped_folio().
Yeah, that will work.
The difference is that pgalloc_tag_split(), split_page_owner(), and
split_page_memcg() only support splitting the whole page into many (> 1)
smaller pieces, but we're trying to create only one smaller page from the
original page.
> When we split a
> new block from the original folio, we create a compound page from the
> block (just the way prep_compound_page_to_free does) and link the
> alloc tag of the original head page to the head of the new compound
> page.
>
> Something like copy_alloc_tag (to be added in v3) below to demo
> my idea, assuming tag=pgalloc_tag_get(original head page):
>
> +/*
> + * Point page's alloc tag to an existing one.
> + */
> +static void copy_alloc_tag(struct page *page, struct alloc_tag *tag)
This should be defined in lib/alloc_tag.c
> +{
> + union pgtag_ref_handle handle;
> + union codetag_ref ref;
> + unsigned long pfn = page_to_pfn(page);
> +
> + if (!mem_alloc_profiling_enabled())
> + return;
> +
> + /* tag is NULL if HugeTLB page is allocated in boot process. */
> + if (!tag)
> + return;
> +
> + if (!get_page_tag_ref(page, &ref, &handle))
> + return;
> +
> + /* Avoid overriding existing alloc tag from page. */
> + if (!ref.ct || is_codetag_empty(&ref)) {
Is it an error if the page already has a valid tag?
> + alloc_tag_ref_set(&ref, tag);
> + update_page_tag_ref(handle, &ref);
> + }
> + put_page_tag_ref(handle);
> +}
> +
> +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> + unsigned long flags, struct
> alloc_tag *tag)
> +{
> + head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> + head->mapping = NULL;
> + head->private = 0;
> +
> + clear_compound_head(head);
> + if (order)
> + prep_compound_page(head, order);
> +
> + copy_alloc_tag(head, tag);
Do we need a similar thing for memcg? Probably not, since it should have
been uncharged before freeing (as long as we're not using it for kmem pages)
Perhaps a comment on why we don't need to split memcg will be enough.
Do we need a similar thing for page_owner? I think yes, although it won't
crash or cause a warning, it will be inconsistent if we split page_owner in
split_page()/__split_unmapped_folio()/etc but not in
prep_compound_page_to_free().
> +}
>
> I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> HWPoison pages before pgalloc_tag_sub().
>
> > > > > + total_freed += nr_pages;
> > > > > + curr = PageHWPoison(next) ? next + 1 : next;
> > > > > + }
> > > > > +
> > > > > + pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > + pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > +}
> > > > > +
> > > > > void free_frozen_pages(struct page *page, unsigned int order)
> > > > > {
> > > > > + struct folio *folio = page_folio(page);
> > > > > +
> > > > > + if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > + folio_clear_has_hwpoisoned(folio);
> > > > > + free_has_hwpoison_pages(page, order);
> > > > > + return;
> > > > > + }
> > > > > +
> > > >
> > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > Can we move this to free_pages_prepare() where we have some
> > > > sanity checks (and also order-0 hwpoison page handling)?
> > >
> > > While free_pages_prepare() seems to be a better place to do the
> > > has_hwpoisoned check, it is not a good place to do
> > > free_has_hwpoison_pages().
> >
> > Why is it not a good place for free_has_hwpoison_pages()?
> >
> > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > the buddy or using the page when it returns false.
>
> What I mean is, callers of free_pages_prepare() wouldn't know from the
> single false return value whether 1) they should completely bail out
> or 2) they should retry with free_has_hwpoison_pages.
I was thinking that once free_pages_prepare() returns false, it already
has done all the work to isolate HWPoison pages and freed healthy portions,
so the caller doesn't have to do anything and can bail out completely.
> So for now I
> think it'd better that free_frozen_pages does the check and act upon
> the check result. Not sure if there is a better place, or worthy to
> change free_pages_prepare()'s return type?
>
> > ...except compaction_free(), which I don't have much idea what it's
> > doing.
> >
> > > > > __free_frozen_pages(page, order, FPI_NONE);
> > > > > }
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2025-12-31 4:38 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-19 18:33 [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-23 5:14 ` Harry Yoo
2025-12-27 1:50 ` Jiaqi Yan
2025-12-29 1:15 ` Harry Yoo
2025-12-31 0:19 ` Jiaqi Yan
2025-12-31 4:37 ` Harry Yoo [this message]
2026-01-06 3:40 ` Jiaqi Yan
2026-01-06 7:35 ` Harry Yoo
2026-01-10 5:48 ` Jiaqi Yan
2025-12-23 7:45 ` Miaohe Lin
2025-12-27 1:50 ` Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
2025-12-22 22:12 ` Matthew Wilcox
2025-12-22 23:13 ` Jiaqi Yan
2025-12-22 22:06 ` [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aVSojEWdFPsb1fg7@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=duenwen@google.com \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=jane.chu@oracle.com \
--cc=jiaqiyan@google.com \
--cc=jthoughton@google.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=osalvador@suse.de \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=tony.luck@intel.com \
--cc=vbabka@suse.cz \
--cc=wangkefeng.wang@huawei.com \
--cc=william.roche@oracle.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.