All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Zi Yan" <ziy@nvidia.com>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	"Zi Yan" <ziy@nvidia.com>, "Jiaqi Yan" <jiaqiyan@google.com>,
	"Miaohe Lin" <linmiaohe@huawei.com>, <harry.yoo@oracle.com>
Cc: <osalvador@suse.de>, <lorenzo.stoakes@oracle.com>,
	<jackmanb@google.com>, <hannes@cmpxchg.org>,
	<nao.horiguchi@gmail.com>, <david@kernel.org>,
	<william.roche@oracle.com>, <tony.luck@intel.com>,
	<wangkefeng.wang@huawei.com>, <jane.chu@oracle.com>,
	<akpm@linux-foundation.org>, <muchun.song@linux.dev>,
	<liam@infradead.org>, <rientjes@google.com>, <duenwen@google.com>,
	<jthoughton@google.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <vbabka@suse.cz>,
	<rppt@kernel.org>, <shuah@kernel.org>, <surenb@google.com>,
	<mhocko@suse.com>, <boudewijn@delta-utec.com>, <ljs@kernel.org>,
	<osalvador@kernel.org>, <willy@infradead.org>
Subject: Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
Date: Thu, 18 Jun 2026 12:04:30 -0400	[thread overview]
Message-ID: <DJCAWX74LL8P.2AX46L7QCCWKV@nvidia.com> (raw)
In-Reply-To: <db7d1c9d-3d73-4502-960c-9c2023b90471@kernel.org>

On Thu Jun 18, 2026 at 10:52 AM EDT, Vlastimil Babka (SUSE) wrote:
> On 6/17/26 03:56, Zi Yan wrote:
>> On Mon Jun 15, 2026 at 11:23 PM EDT, Jiaqi Yan wrote:
>>> On Fri, Jun 12, 2026 at 11:34 AM Zi Yan <ziy@nvidia.com> wrote:
>>>>
>>>> On 8 Jun 2026, at 23:44, Miaohe Lin wrote:
>>>>
>>>> > On 2026/5/31 13:58, Jiaqi Yan wrote:
>>>> >> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>>>> >> becomes non-HugeTLB, and it is released to buddy allocator
>>>> >> as a high-order folio, e.g. a folio that contains 262144 pages
>>>> >> if the folio was a 1G HugeTLB hugepage.
>>>> >>
>>>> >> This is problematic if the HugeTLB hugepage contained HWPoison
>>>> >> subpages. In that case, since buddy allocator does not check
>>>> >> HWPoison for non-zero-order folio, the raw HWPoison page can
>>>> >> be given out with its buddy page and be re-used by either
>>>> >> kernel or userspace.
>>>> >>
>>>> >> Memory failure recovery (MFR) in kernel does attempt to take
>>>> >> raw HWPoison page off buddy allocator after
>>>> >> dissolve_free_hugetlb_folio(). However, there is always a time
>>>> >> window between dissolve_free_hugetlb_folio() frees a HWPoison
>>>> >> high-order folio to buddy allocator and MFR takes HWPoison
>>>> >> raw page off buddy allocator.
>>>> >>
>>>> >> Another similar situation is when a transparent huge page (THP)
>>>> >> runs into memory failure but splitting failed. Such THP will
>>>> >> eventually be released to buddy allocator when owning userspace
>>>> >> processes are gone, but with certain subpages having HWPoison.
>>>> >>
>>>> >> One obvious way to avoid both problems is to add page sanity
>>>> >> checks in page allocate or free path. However, it is against
>>>> >> the past efforts to reduce sanity check overhead [1,2,3].
>>>> >>
>>>> >> Introduce free_has_hwpoisoned() to only free the healthy pages
>>>> >> and to exclude the HWPoison ones in the high-order folio.
>>>> >> The idea is to iterate through the sub-pages of the folio to
>>>> >> identify contiguous ranges of healthy pages.
>>>> >>
>>>> >> free_has_hwpoisoned() is added in free_pages_prepare() as
>>>> >> a shortcut and is only invoked if PG_has_hwpoisoned indicates
>>>> >> HWPoison page exists and after checks and preparations in
>>>> >> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
>>>> >> can re-use free_prepared_contig_range() [4] to decompose healthy
>>>> >> ranges into the largest possible chunks of different orders.
>>>> >> Every chunk meets the requirements to be freed via free_one_page().
>>>> >>
>>>> >> free_has_hwpoisoned() has linear time complexity wrt the number
>>>> >> of pages in the folio. While the power-of-two decomposition
>>>> >> ensures that the number of calls to the buddy allocator is
>>>> >> logarithmic for each contiguous healthy range, the mandatory
>>>> >> linear scan of pages to identify PageHWPoison() defines the
>>>> >> overall time complexity. For a 1G hugepage having 8 HWPoison
>>>> >> pages, free_has_hwpoisoned() takes around 1ms on average on
>>>> >> a system having 56 Intel Skylake physical cores. This is
>>>> >> 15x to the case of freeing no HWPoison page. The cost is far
>>>> >> from triggering soft lockup, and fair for handling exceptional
>>>> >> hardware memory errors.
>>>> >>
>>>> >> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>>>> >> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>>>> >> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>>> >> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
>>>> >>
>>>> >> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>>> >
>>>> > Thanks for your update. This patch looks good to me while some comments below.
>>>> >
>>>> >> ---
>>>> >>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>  1 file changed, 85 insertions(+)
>>>> >>
>>>> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> >> index e47679e7a9db..03df929abca6 100644
>>>> >> --- a/mm/page_alloc.c
>>>> >> +++ b/mm/page_alloc.c
>>>> >> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>>>> >>  unsigned int pageblock_order __read_mostly;
>>>> >>  #endif
>>>> >>
>>>> >> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>>>> >>  static void __free_pages_ok(struct page *page, unsigned int order,
>>>> >>                          fpi_t fpi_flags);
>>>> >>  static void reserve_highatomic_pageblock(struct page *page, int order,
>>>> >> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>>>> >>
>>>> >>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>>>> >>
>>>> >> +/*
>>>> >> + * Returns
>>>> >> + * - true: checks and preparations all good, caller can proceed freeing.
>>>> >> + * - false: do not proceed freeing for one of the following reasons:
>>>> >> + *   1. Some check failed so it is not safe to proceed freeing.
>>>> >> + *   2. A compound page has some HWPoison pages. The healthy pages
>>>> >> + *      are already safely freed, and the HWPoison ones isolated.
>>>> >> + */
>>>> >>  static __always_inline bool __free_pages_prepare(struct page *page,
>>>> >>              unsigned int order, fpi_t fpi_flags)
>>>> >>  {
>>>> >> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>> >>      bool init = want_init_on_free();
>>>> >>      bool compound = PageCompound(page);
>>>> >>      struct folio *folio = page_folio(page);
>>>> >> +    /*
>>>> >> +     * When dealing with compound page, PG_has_hwpoisoned is cleared
>>>> >> +     * with PAGE_FLAGS_SECOND. So the check must be done first.
>>>> >> +     *
>>>> >> +     * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
>>>> >> +     * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
>>>> >> +     * confuse and complaint that the first tail page is still active.
>>>> >> +     */
>>>> >> +    bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>>>> >>
>>>> >>      if (fpi_flags & FPI_PREPARED)
>>>> >>              return true;
>>>> >> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>> >>
>>>> >>      debug_pagealloc_unmap_pages(page, 1 << order);
>>>> >>
>>>> >> +    /*
>>>> >> +     * After breaking down compound page and dealing with page metadata
>>>> >> +     * (e.g. page owner and page alloc tags), take a shortcut if this
>>>> >> +     * was a compound page containing certain HWPoison subpages.
>>>> >> +     */
>>>> >> +    if (should_fhh) {
>>>> >> +            free_has_hwpoisoned(page, order);
>>>> >> +            return false;
>>>> >> +    }
>>>> >
>>>> > When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
>>>> > kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
>>>> > the hwpoisoned pages. Is it safe to do so?
>>>>
>>>> At least, kernel_poison_pages() writes to the page. It probably should be
>>>> moved up, somewhere like above kernel_poison_pages().
>>>
>>> Writing to HWPoison pages (location having memory error) is usually
>>> safe, as in it doesn't cause a machine check exception. Memory
>>> controller usually just fails the write op and waits for the next read
>>> to raise the MCE / exception for prevent silent data corruption.
>>>
>>>>
>>>> I do not like the shortcut method, since the pages are freed in
>>>> __free_pages_prepare(). This causes confusion. One alternative I can think
>>>
>>> What exactly is the confusion? or why does freeing has to be done by
>>> __free_pages_prepare()'s caller?
>>>
>>> Harry suggested the shortcut method [1]. Although freeing inline might
>>> surprise the caller, it simplifies things for all callers.
>>>
>>> [1] https://lore.kernel.org/linux-mm/aVy7L-3pc4JUYBEn@hyeyoo
>> 
>> The function name is __free_pages_prepare(), but the code ends up with
>> freeing the pages if the folio has HWPoison.
>
> It might free some, or leak some (already before this patch). Seems to me
> really the only important part for the caller is if it can proceed freeing
> or not.

OK, at least add a comment to __free_pages_prepare() to document all
possible outcomes.

>
>>>
>>>> of is to make __free_pages_prepare() returns a enum
>>>> { FREE_PAGE_PREPARE_SUCCESS, FREE_PAGE_PREPARE_FAIL, FREE_PAGE_PREPARE_HAS_HWPOISON}
>>>> and handle the return value in the caller.
>
> Until the callers need to distinguish that, it sounds like an unnecessary
> complication to me.
>

Sure.

>>> Are you worried that a caller of __free_pages_prepare() may see false
>>> returned in the has_hwpoison case, but mistakenly propagate some kind
>>> of error, or doing something under the assumption that folio not
>>> freed? In this case the three enums can be useful. But I checked
>>> current callers of __free_pages_prepare() and they don't have the
>>> above problem.
>> 
>> Right. Looking at compaction_free() code, if dst gets has_hwpoison (not
>> possible now, but if in the future migration code decides to mark folios
>> when copy fails with EHWPOISON), the next list_add() is going to cause
>
> I think we should fix compaction_free() (as a separate patch preceding this
> one) to not proceed if prepare returns false. It could in theory already
> happen before this patch due to a random memory corruption of struct page
> causing some of the existing checks to fail.

Something like below should do the job, plus
Fixes: 733aea0b3a7bb ("mm/compaction: add support for >0 order folio memory compaction.")

diff --git a/mm/compaction.c b/mm/compaction.c
index b776f35ad0200..b2104cbe63477 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1876,10 +1876,12 @@ static void compaction_free(struct folio *dst, unsigned long data)
 	struct page *page = &dst->page;
 
 	if (folio_put_testzero(dst)) {
-		free_pages_prepare(page, order);
+		if (!free_pages_prepare(page, order))
+			goto skip;
 		list_add(&dst->lru, &cc->freepages[order]);
 		cc->nr_freepages += 1 << order;
 	}
+skip:
 	cc->nr_migratepages += 1 << order;
 	/*
 	 * someone else has referenced the page, we cannot take it back to our

>
>> trouble. Or you can rename the function to
>> __free_pages_prepare_and_free_has_hwpoison()? At least, caller knows the
>> potential side effect.
>
> Uh that's long. All callers are from PAGE ALLOCATOR mm-subsystem, it's not a
> driver API so we'll know what we're doing (famous last words :)

The name above is a half joke. :)

BTW, I do not even trust myself sometimes. ;) Just look at the
compaction_free() issue I introduced myself. But I do not want to be too
pedantic here. So a comment above __free_pages_prepare() should be
sufficient.

-- 
Best Regards,
Yan, Zi



  reply	other threads:[~2026-06-18 16:04 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-31  5:58 [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
2026-05-31  5:58 ` [PATCH v5 1/4] mm/page_alloc: only " Jiaqi Yan
2026-06-09  3:44   ` Miaohe Lin
2026-06-12 18:34     ` Zi Yan
2026-06-16  3:23       ` Jiaqi Yan
2026-06-16  6:53         ` Miaohe Lin
2026-06-18 15:02           ` Vlastimil Babka (SUSE)
2026-06-17  1:56         ` Zi Yan
2026-06-18 14:52           ` Vlastimil Babka (SUSE)
2026-06-18 16:04             ` Zi Yan [this message]
2026-06-15  2:03     ` Jiaqi Yan
2026-05-31  5:58 ` [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio Jiaqi Yan
2026-06-09  6:34   ` Miaohe Lin
2026-05-31  5:58 ` [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page Jiaqi Yan
2026-06-09  7:21   ` Miaohe Lin
2026-06-15  0:16     ` Jiaqi Yan
2026-06-16  7:05       ` Miaohe Lin
2026-05-31  5:58 ` [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test Jiaqi Yan
2026-06-01 18:04   ` Jiaqi Yan
2026-06-17  7:38   ` Miaohe Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DJCAWX74LL8P.2AX46L7QCCWKV@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=boudewijn@delta-utec.com \
    --cc=david@kernel.org \
    --cc=duenwen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=jackmanb@google.com \
    --cc=jane.chu@oracle.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=liam@infradead.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@kernel.org \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=tony.luck@intel.com \
    --cc=vbabka@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=william.roche@oracle.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.