From: Harry Yoo <harry.yoo@oracle.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: Matthew Wilcox <willy@infradead.org>,
ziy@nvidia.com, david@redhat.com,
Vlastimil Babka <vbabka@suse.cz>,
nao.horiguchi@gmail.com, linmiaohe@huawei.com,
lorenzo.stoakes@oracle.com, william.roche@oracle.com,
tony.luck@intel.com, wangkefeng.wang@huawei.com,
jane.chu@oracle.com, akpm@linux-foundation.org,
osalvador@suse.de, muchun.song@linux.dev, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Michal Hocko <mhocko@suse.com>,
Suren Baghdasaryan <surenb@google.com>,
Brendan Jackman <jackmanb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
Date: Tue, 18 Nov 2025 19:19:43 +0900 [thread overview]
Message-ID: <aRxIP7StvLCh-dc2@hyeyoo> (raw)
In-Reply-To: <CACw3F50E=AZtgfoExCA-nwS6=NYdFFWpf6+GBUYrWiJOz4xwaw@mail.gmail.com>
On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
> On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> > > On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> > > > But since we're only doing this on free, we won't need to do folio
> > > > allocations at all; we'll just be able to release the good pages to the
> > > > page allocator and sequester the hwpoison pages.
> > >
> > > [+Cc PAGE ALLOCATOR folks]
> > >
> > > So we need an interface to free only healthy portion of a hwpoison folio.
>
> +1, with some of my own thoughts below.
>
> > >
> > > I think a proper approach to this should be to "free a hwpoison folio
> > > just like freeing a normal folio via folio_put() or free_frozen_pages(),
> > > then the page allocator will add only healthy pages to the freelist and
> > > isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> > > which is too fragile.
> >
> > Yes, I think it should be handled by the page allocator. There may be
>
> I agree with Matthew, Harry, and David. The page allocator seems best
> suited to handle HWPoison subpages without any new folio allocations.
Sorry I should have been clearer. I don't think adding an **explicit**
interface to free an hwpoison folio is worth; instead implicitly
handling during freeing of a folio seems more feasible.
> > some complexity to this that I've missed, eg if hugetlb wants to retain
> > the good 2MB chunks of a 1GB allocation. I'm not sure that's a useful
> > thing to do or not.
> >
> > > In fact, that can be done by teaching free_pages_prepare() how to handle
> > > the case where one or more subpages of a folio are hwpoison pages.
> > >
> > > How this should be implemented in the page allocator in memdescs world?
> > > Hmm, we'll want to do some kind of non-uniform split, without actually
> > > splitting the folio but allocating struct buddy?
> >
> > Let me sketch that out, realising that it's subject to change.
> >
> > A page in buddy state can't need a memdesc allocated. Otherwise we're
> > allocating memory to free memory, and that way lies madness. We can't
> > do the hack of "embed struct buddy in the page that we're freeing"
> > because HIGHMEM. So we'll never shrink struct page smaller than struct
> > buddy (which is fine because I've laid out how to get to a 64 bit struct
> > buddy, and we're probably two years from getting there anyway).
> >
> > My design for handling hwpoison is that we do allocate a struct hwpoison
> > for a page. It looks like this (for now, in my head):
> >
> > struct hwpoison {
> > memdesc_t original;
> > ... other things ...
> > };
> >
> > So we can replace the memdesc in a page with a hwpoison memdesc when we
> > encounter the error. We still need a folio flag to indicate that "this
> > folio contains a page with hwpoison". I haven't put much thought yet
> > into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
> > includes an index of where the actually poisoned page is in the folio,
> > so it doesn't matter if the pages alias with each other as we can recover
> > the information when it becomes useful to do so.
> >
> > > But... for now I think hiding this complexity inside the page allocator
> > > is good enough. For now this would just mean splitting a frozen page
>
> I want to add one more thing. For HugeTLB, kernel clears the HWPoison
> flag on the folio and move it to every raw pages in raw_hwp_page list
> (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
> some pages passed into free_frozen_pages has HWPoison. It has to
> traverse 2^order pages to tell, if I am not mistaken, which goes
> against the past effort to reduce sanity checks. I believe this is one
> reason I choosed to handle the problem in hugetlb / memory-failure.
I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the
buddy allocator to handle this. free_pages_prepare() already handles
(PageHWPoison(page) && !order) case, we just need to extend that to
support hugetlb folios as well.
> For the new interface Harry requested, is it the caller's
> responsibility to ensure that the folio contains HWPoison pages (to be
> even better, maybe point out the exact ones?), so that page allocator
> at least doesn't waste cycles to search non-exist HWPoison in the set
> of pages?
With implicit handling it would be the page allocator's responsibility
to check and handle hwpoison hugetlb folios.
> Or caller and page allocator need to agree on some contract? Say
> caller has to set has_hwpoisoned flag in non-zero order folio to free.
> This allows the old interface free_frozen_pages an easy way using the
> has_hwpoison flag from the second page. I know has_hwpoison is "#if
> defined" on THP and using it for hugetlb probably is not very clean,
> but are there other concerns?
As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
folio. But for a hugetlb folio folio_test_hwpoison() returns true
if it has at least one hwpoison pages (assuming that we don't clear it
before freeing).
So in free_pages_prepare():
if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
/*
* Handle hwpoison hugetlb folios; transfer the error information
* to individual pages, clear hwpoison flag of the folio,
* perform non-uniform split on the frozen folio.
*/
} else if (PageHWPoison(page) && !order) {
/* We already handle this in the allocator. */
}
This would be sufficient?
Or do we want to handle THPs as well, in case of split failure in
memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
case as well...
> > > inside the page allocator (probably non-uniform?). We can later re-implement
> > > this to provide better support for memdescs.
> >
> > Yes, I like this approach. But then I'm not the page allocator
> > maintainer ;-)
>
> If page allocator maintainers can weigh in here, that will be very helpful!
Yeah, I'm not a maintainer either ;) it'll be great to get opinions
from page allocator folks!
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2025-11-18 10:21 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-16 1:47 [PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-11-16 1:47 ` [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Jiaqi Yan
2025-11-16 11:51 ` Matthew Wilcox
2025-11-17 3:15 ` Harry Yoo
2025-11-17 3:21 ` Zi Yan
2025-11-17 3:39 ` Harry Yoo
2025-11-17 13:43 ` Matthew Wilcox
2025-11-18 6:24 ` Jiaqi Yan
2025-11-18 10:19 ` Harry Yoo [this message]
2025-11-18 19:26 ` Jiaqi Yan
2025-11-18 21:54 ` Zi Yan
2025-11-19 12:37 ` Harry Yoo
2025-11-19 19:21 ` Jiaqi Yan
2025-11-19 20:35 ` Zi Yan
2025-11-16 22:38 ` kernel test robot
2025-11-17 17:12 ` David Hildenbrand (Red Hat)
2025-11-16 1:47 ` [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio Jiaqi Yan
2025-11-16 2:10 ` Zi Yan
2025-11-18 5:12 ` Jiaqi Yan
2025-11-17 17:15 ` David Hildenbrand (Red Hat)
2025-11-18 5:17 ` Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aRxIP7StvLCh-dc2@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=jane.chu@oracle.com \
--cc=jiaqiyan@google.com \
--cc=linmiaohe@huawei.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=osalvador@suse.de \
--cc=surenb@google.com \
--cc=tony.luck@intel.com \
--cc=vbabka@suse.cz \
--cc=wangkefeng.wang@huawei.com \
--cc=william.roche@oracle.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.