[PATCH v5 0/4] Only free healthy pages in high-order has

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio
@ 2026-05-31  5:58 Jiaqi Yan
  2026-05-31  5:58 ` [PATCH v5 1/4] mm/page_alloc: only " Jiaqi Yan
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Jiaqi Yan @ 2026-05-31  5:58 UTC (permalink / raw)
  To: ljs, linmiaohe, osalvador, ziy, harry.yoo, willy
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn,
	Jiaqi Yan

At the end of dissolve_free_hugetlb_folio(), a free HugeTLB
folio becomes non-HugeTLB and is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) code (mm/memory-failure.c) does
attempt to take raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio(). However, there is always a time
window between dissolve_free_hugetlb_folio() frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.

Another similar situation is when a transparent huge page (THP)
is handled by MFR but splitting failed. Such THP will eventually
be released to buddy allocator when owning userspace processes
are gone, but with certain subpages having HWPoison [9].

One obvious way to avoid both problems is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce free_has_hwpoisoned() to only free the healthy pages
and excludes the HWPoison ones in the high-order folio.
free_has_hwpoisoned() happens at the end of free_pages_prepare(),
which already deals with both decomposing the original compound
page, updating page metadata like alloc tag and page owner.
It is also only applied when PG_has_hwpoisoned indicates folio
contains certain HWPoison page(s) for performance reason.
Its idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages. Instead of freeing
pages one by one, free_has_hwpoisoned() then re-use
free_prepared_contig_range() [11] to decompose healthy ranges into
the largest possible chunks of different orders. Every chunk is
freed via free_one_page().

free_has_hwpoisoned() has linear time complexity wrt the number
of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison defines the
overall time complexity.

I tested with some test-only code [4] and hugetlb-mfr [5], by
checking the status of pcplist and freelist immediately after
dissolve_free_hugetlb_folio() a free 2M or 1G HugeTLB page that
contains 1~8 HWPoison raw pages:

- HWPoison pages are excluded by free_has_hwpoisoned().

- Some healthy pages can be in zone->per_cpu_pageset (pcplist)
  because pcp_count is not high enough. Many healthy pages are
  in some order's zone->free_area[order].free_list (freelist).

- In rare cases, some healthy pages are in neither pcplist
  nor freelist. My best guest is they are allocated before
  the test checks.

To illustrate the latency free_has_hwpoisoned() added to the
memory freeing path, I tested its time cost with 8 HWPoison
pages with instrument code in [4] for 20 sample runs on a machine
having 56 Intel Skylake physical cores and 768GB memory:

- Has HWPoison path: mean=1030us, stdev=21us

- No HWPoison path: mean=66us, stdev=6us

free_has_hwpoisoned() is around 15x the baseline. Its cost is
nontrivial, but still far from triggering soft lockup, and fair
for handling exceptional hardware memory errors.

Now that free_has_hwpoisoned() ensures HWPoison pages never made
into buddy allocator, MFR don't need to take_page_off_buddy() anymore
after disovling HWPoison hugepages. So replace __page_handle_poison()
with new __hugepage_handle_poison() for HugeTLB specific call sites.
It may worthy to note that this patchset doesn't affect the soft
offline behavior in MFR. This is because soft offline does not
folio_set_hwpoison() upfront, and for HugeTLB case, doesn't involve
get_huge_page_for_hwpoison().

To provide test coverage for the new __hugepage_handle_poison()
in me_huge_page(), the last commit adds a MADV_HARD test variant
for anonymous 1G HugeTLB pages. It also cover the code path that
frees a 1G HugeTLB page that contains 1 raw HWPoison page.

Based on commit 43eedcbb989c ("userfaultfd: make functions that are not used outside uffd static")

Changelog

v4 [10] -> v5

- Rebase to very recent akpm/mm-unstable.

- Re-use free_prepared_contig_range() introduced by [11], and remove
  free_contiguous_pages() in v4.

- Instead of using struct page pointer, iterate over pfn in
  free_has_hwpoison().

- Re-ested and re-evaluated free_has_hwpoison()'s time cost.

- Add memory failure recovery test for anonymous 1G HugeTLB page to
  gain test coverage for __hugepage_handle_poison() and for freeing
  1G HugeTLB page that has 1 HWPoison page.

v3 [8] -> v4

- Address comments from Zi Yan, Miaohe Lin, Harry Yoo.

- Set has_hwpoisoned flag after introducing free_has_hwpoisoned().

- Unwrap free_pages_prepare_has_hwpoisoned() into free_pages_prepare().

- If folio has HWPoison, its healthy pages will be freed with FPI_NONE
  right in free_pages_prepare(), who returns false to indicate caller
  should not proceeding its own freeing action.

- Rework the commit on __page_handle_poison. Only change the handling
  for HWPoison HugeTLB page, leaving free buddy page and soft offline
  handling alone.

v2 [7] -> v3:

- Address comments from Mathew Wilcox, Harry Hoo, Miaohe Lin.

- Let free_has_hwpoisoned() happen after free_pages_prepare(),
  which help to deal with decomposing the original compound page,
  and with page metadata like alloc tag and page owner.

- Tested with "page_owner=on" and CONFIG_MEM_ALLOC_PROFILING*=y.

- Wrap checking PG_has_hwpoisoned and free_has_hwpoisoned() into
  free_pages_prepare_has_hwpoisoned(), which replaces
  free_pages_prepare() calls in free_frozen_pages().

- Rename free_has_hwpoison_page() to free_has_hwpoisoned().

- Measure latency added by free_has_hwpoisoned().

- Ensure struct page *end is only used for pointer arithmetic,
  instead of accessed as page.

- Refactor page_handl_poison instead of just __page_handle_poison().

v1 [6] -> v2:

- Total reimplementation based on discussions with Mathew Wilcox,
  Harry Hoo, Zi Yan etc

- hugetlb_free_hwpoison_folio() => free_has_hwpoison_pages().

- Utilize has_hwpoisoned flag to tell buddy allocator a high-order
  folio contains HWPoison.

- Simplify __page_handle_poison() given that the HWPoison page(s)
  won't be freed within high-order folio.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://drive.google.com/file/d/1CzJn1Cc4wCCm183Y77h244fyZIkTLzCt/view?usp=sharing
[5] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com
[6] https://lore.kernel.org/linux-mm/20251116014721.1561456-1-jiaqiyan@google.com
[7] https://lore.kernel.org/linux-mm/20251219183346.3627510-1-jiaqiyan@google.com
[8] https://lore.kernel.org/linux-mm/20260112004923.888429-1-jiaqiyan@google.com
[9] https://lore.kernel.org/linux-mm/20260113205441.506897-1-boudewijn@delta-utec.com
[10] https://lore.kernel.org/linux-mm/20260202194125.2191216-1-jiaqiyan@google.com
[11] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com

Jiaqi Yan (4):
  mm/page_alloc: only free healthy pages in high-order has_hwpoisoned
    folio
  mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio
  mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison
    HugeTLB page
  selftests/mm: add hard memory failure anonymous 1G HugeTLB page test

 include/linux/page-flags.h                  |  2 +-
 mm/memory-failure.c                         | 37 +++++++--
 mm/page_alloc.c                             | 85 +++++++++++++++++++++
 tools/testing/selftests/mm/memory-failure.c | 73 ++++++++++++++++--
 4 files changed, 185 insertions(+), 12 deletions(-)

-- 
2.54.0.823.g6e5bcc1fc9-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-05-31  5:58 [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
@ 2026-05-31  5:58 ` Jiaqi Yan
  2026-06-09  3:44   ` Miaohe Lin
  2026-05-31  5:58 ` [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio Jiaqi Yan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2026-05-31  5:58 UTC (permalink / raw)
  To: ljs, linmiaohe, osalvador, ziy, harry.yoo, willy
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn,
	Jiaqi Yan

At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
becomes non-HugeTLB, and it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio(). However, there is always a time
window between dissolve_free_hugetlb_folio() frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.

Another similar situation is when a transparent huge page (THP)
runs into memory failure but splitting failed. Such THP will
eventually be released to buddy allocator when owning userspace
processes are gone, but with certain subpages having HWPoison.

One obvious way to avoid both problems is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce free_has_hwpoisoned() to only free the healthy pages
and to exclude the HWPoison ones in the high-order folio.
The idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages.

free_has_hwpoisoned() is added in free_pages_prepare() as
a shortcut and is only invoked if PG_has_hwpoisoned indicates
HWPoison page exists and after checks and preparations in
free_pages_prepare() all succeeded. free_has_hwpoisoned() then
can re-use free_prepared_contig_range() [4] to decompose healthy
ranges into the largest possible chunks of different orders.
Every chunk meets the requirements to be freed via free_one_page().

free_has_hwpoisoned() has linear time complexity wrt the number
of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison() defines the
overall time complexity. For a 1G hugepage having 8 HWPoison
pages, free_has_hwpoisoned() takes around 1ms on average on
a system having 56 Intel Skylake physical cores. This is
15x to the case of freeing no HWPoison page. The cost is far
from triggering soft lockup, and fair for handling exceptional
hardware memory errors.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e47679e7a9db..03df929abca6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 unsigned int pageblock_order __read_mostly;
 #endif

+static void free_has_hwpoisoned(struct page *page, unsigned int order);
 static void __free_pages_ok(struct page *page, unsigned int order,
 			    fpi_t fpi_flags);
 static void reserve_highatomic_pageblock(struct page *page, int order,
@@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)

 #endif /* CONFIG_MEM_ALLOC_PROFILING */

+/*
+ * Returns
+ * - true: checks and preparations all good, caller can proceed freeing.
+ * - false: do not proceed freeing for one of the following reasons:
+ *   1. Some check failed so it is not safe to proceed freeing.
+ *   2. A compound page has some HWPoison pages. The healthy pages
+ *      are already safely freed, and the HWPoison ones isolated.
+ */
 static __always_inline bool __free_pages_prepare(struct page *page,
 		unsigned int order, fpi_t fpi_flags)
 {
@@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
 	bool init = want_init_on_free();
 	bool compound = PageCompound(page);
 	struct folio *folio = page_folio(page);
+	/*
+	 * When dealing with compound page, PG_has_hwpoisoned is cleared
+	 * with PAGE_FLAGS_SECOND. So the check must be done first.
+	 *
+	 * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
+	 * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
+	 * confuse and complaint that the first tail page is still active.
+	 */
+	bool should_fhh = compound && folio_test_has_hwpoisoned(folio);

 	if (fpi_flags & FPI_PREPARED)
 		return true;
@@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,

 	debug_pagealloc_unmap_pages(page, 1 << order);

+	/*
+	 * After breaking down compound page and dealing with page metadata
+	 * (e.g. page owner and page alloc tags), take a shortcut if this
+	 * was a compound page containing certain HWPoison subpages.
+	 */
+	if (should_fhh) {
+		free_has_hwpoisoned(page, order);
+		return false;
+	}
+
 	return true;
 }

@@ -6936,6 +6964,63 @@ void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
 	__free_contig_range_common(pfn, nr_pages, /* is_frozen= */ false);
 }

+/*
+ * Given a high-order compound page containing certain number of HWPoison
+ * pages, free only the healthy ones.
+ *
+ * Pages must have passed free_pages_prepare(). Even if having HWPoison
+ * pages, breaking down compound page and updating metadata (e.g. page
+ * owner, alloc tag) can be done together during free_pages_prepare(),
+ * which simplifies the splitting here: unlike __split_unmapped_folio(),
+ * there is no need to turn split pages into a compound page or to carry
+ * metadata.
+ *
+ * It scans every raw page of the compound page and cause nontrivial overhead.
+ * So only use this when the compound page contains HWPoison page(s).
+ *
+ * This implementation needs rework in memdesc world.
+ */
+static void free_has_hwpoisoned(struct page *page, unsigned int order)
+{
+	unsigned long curr = page_to_pfn(page);
+	unsigned long end_pfn = curr + (1 << order);
+	unsigned long next;
+	unsigned long total_freed = 0;
+	unsigned long total_hwp = 0;
+
+	VM_WARN_ON(order == 0);
+	VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
+
+	while (curr < end_pfn) {
+		next = curr;
+
+		while (next < end_pfn && !PageHWPoison(pfn_to_page(next)))
+			++next;
+
+		if (next != end_pfn && PageHWPoison(pfn_to_page(next))) {
+			/*
+			 * Avoid accounting error when the page is freed
+			 * by unpoison_memory().
+			 */
+			clear_page_tag_ref(pfn_to_page(next));
+			++total_hwp;
+		}
+
+		free_prepared_contig_range(pfn_to_page(curr), next - curr);
+		total_freed += next - curr;
+
+		if (next == end_pfn)
+			break;
+
+		VM_WARN_ON(!PageHWPoison(pfn_to_page(next)));
+		curr = next + 1;
+	}
+
+	VM_WARN_ON(total_freed + total_hwp != (1 << order));
+	pr_info("Freed %#lx pages, excluded %lu HWPoison pages\n",
+		total_freed, total_hwp);
+}
+
 #ifdef CONFIG_CONTIG_ALLOC
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
 static void alloc_contig_dump_pages(struct list_head *page_list)
-- 
2.54.0.823.g6e5bcc1fc9-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio
  2026-05-31  5:58 [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
  2026-05-31  5:58 ` [PATCH v5 1/4] mm/page_alloc: only " Jiaqi Yan
@ 2026-05-31  5:58 ` Jiaqi Yan
  2026-06-09  6:34   ` Miaohe Lin
  2026-05-31  5:58 ` [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page Jiaqi Yan
  2026-05-31  5:58 ` [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test Jiaqi Yan
  3 siblings, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2026-05-31  5:58 UTC (permalink / raw)
  To: ljs, linmiaohe, osalvador, ziy, harry.yoo, willy
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn,
	Jiaqi Yan

When a free HWPoison HugeTLB folio is dissolved, it becomes
non-HugeTLB and is released to buddy allocator as a high-order
folio.

Set has_hwpoisoned flags on the high-order folio so that buddy
allocator can tell that it contains certain HWPoison page(s),
and can handle it specially with free_has_hwpoisoned().

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 include/linux/page-flags.h | 2 +-
 mm/memory-failure.c        | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..223ec3b2d62f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -893,7 +893,7 @@ static inline int PageTransCompound(const struct page *page)
 TESTPAGEFLAG_FALSE(TransCompound, transcompound)
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#if defined(CONFIG_MEMORY_FAILURE) && (defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE))
 /*
  * PageHasHWPoisoned indicates that at least one subpage is hwpoisoned in the
  * compound page.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c405..95979b7995c1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1951,6 +1951,7 @@ void folio_clear_hugetlb_hwpoison(struct folio *folio)
 	if (folio_test_hugetlb_vmemmap_optimized(folio))
 		return;
 	folio_clear_hwpoison(folio);
+	folio_set_has_hwpoisoned(folio);
 	folio_free_raw_hwp(folio, true);
 }
 
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page
  2026-05-31  5:58 [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
  2026-05-31  5:58 ` [PATCH v5 1/4] mm/page_alloc: only " Jiaqi Yan
  2026-05-31  5:58 ` [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio Jiaqi Yan
@ 2026-05-31  5:58 ` Jiaqi Yan
  2026-06-09  7:21   ` Miaohe Lin
  2026-05-31  5:58 ` [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test Jiaqi Yan
  3 siblings, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2026-05-31  5:58 UTC (permalink / raw)
  To: ljs, linmiaohe, osalvador, ziy, harry.yoo, willy
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn,
	Jiaqi Yan

Now that HWPoison subpage(s) within HugeTLB page will be rejected by
buddy allocator during dissolve_free_hugetlb_folio(), there is no
need to drain_all_pages() and take_page_off_buddy() anymore. In fact,
calling take_page_off_buddy() after dissolve_free_hugetlb_folio()
succeeded returns false, making caller think __page_handle_poison()
failed.

Add __hugepage_handle_poison() and replace __page_handle_poison() at
HugeTLB specific call sites. The being handled HugeTLB page either
is free at the moment of try_memory_failure_hugetlb(), or becomes
free at the moment of me_huge_page().

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 95979b7995c1..098c4407e818 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -163,6 +163,30 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
 static DEFINE_MUTEX(pfn_space_lock);
 
 /*
+ * Only for a HugeTLB page being handled by memory_failure(). The key
+ * difference to soft_offline() is that, no HWPoison subpage will make
+ * into buddy allocator after a successful dissolve_free_hugetlb_folio(),
+ * so take_page_off_buddy() is unnecessary.
+ */
+static int __hugepage_handle_poison(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	VM_WARN_ON_FOLIO(!folio_test_hwpoison(folio), folio);
+
+	/*
+	 * Can't use dissolve_free_hugetlb_folio() without a reliable
+	 * raw_hwp_list telling which subpage is HWPoison.
+	 */
+	if (folio_test_hugetlb_raw_hwp_unreliable(folio))
+		/* raw_hwp_list becomes unreliable when kmalloc() fails. */
+		return -ENOMEM;
+
+	return dissolve_free_hugetlb_folio(folio);
+}
+
+/*
+ * Only for a free or HugeTLB page being handled by soft_offline().
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
  *   0:   the page is dissolved (if needed) and not taken off from buddy,
@@ -1166,11 +1190,11 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		 * subpages.
 		 */
 		folio_put(folio);
-		if (__page_handle_poison(p) > 0) {
+		if (__hugepage_handle_poison(p)) {
+			res = MF_FAILED;
+		} else {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
-		} else {
-			res = MF_FAILED;
 		}
 	}
 
@@ -2076,11 +2100,11 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags)
 	 */
 	if (res == MF_HUGETLB_FREED) {
 		folio_unlock(folio);
-		if (__page_handle_poison(p) > 0) {
+		if (__hugepage_handle_poison(p)) {
+			res = MF_FAILED;
+		} else {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
-		} else {
-			res = MF_FAILED;
 		}
 		return action_result(pfn, MF_MSG_FREE_HUGE, res);
 	}
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test
  2026-05-31  5:58 [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
                   ` (2 preceding siblings ...)
  2026-05-31  5:58 ` [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page Jiaqi Yan
@ 2026-05-31  5:58 ` Jiaqi Yan
  2026-06-01 18:04   ` Jiaqi Yan
  2026-06-17  7:38   ` Miaohe Lin
  3 siblings, 2 replies; 20+ messages in thread
From: Jiaqi Yan @ 2026-05-31  5:58 UTC (permalink / raw)
  To: ljs, linmiaohe, osalvador, ziy, harry.yoo, willy
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn,
	Jiaqi Yan

Add a new testcase to validate memory failure recovery for HWPoison
anonymous 1G HugeTLB page, including proper SIGBUS delivery,
releasing a 1G HugeTLB page containing one HWPoison page to buddy
allocator, and isolation of the raw HWPoison page.

Although can be added in future, this patch does not support testing
the MADV_SOFT variant.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 tools/testing/selftests/mm/memory-failure.c | 73 +++++++++++++++++++--
 1 file changed, 68 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/mm/memory-failure.c b/tools/testing/selftests/mm/memory-failure.c
index 032ed952057c..ea43b2877c81 100644
--- a/tools/testing/selftests/mm/memory-failure.c
+++ b/tools/testing/selftests/mm/memory-failure.c
@@ -18,6 +18,7 @@
 #include <linux/magic.h>
 #include <errno.h>
 
+#include "hugepage_settings.h"
 #include "vm_util.h"
 
 enum inject_type {
@@ -27,6 +28,7 @@ enum inject_type {
 
 enum result_type {
 	MADV_HARD_ANON,
+	MADV_HARD_ANON_HUGETLB,
 	MADV_HARD_CLEAN_PAGECACHE,
 	MADV_HARD_DIRTY_PAGECACHE,
 	MADV_SOFT_ANON,
@@ -47,6 +49,8 @@ FIXTURE(memory_failure)
 	int pagemap_fd;
 	int kpageflags_fd;
 	bool triggered;
+	/* Number of initial HugeTLB pages with default page size. */
+	unsigned long nr_hugetlb_pages;
 };
 
 FIXTURE_VARIANT(memory_failure)
@@ -157,11 +161,11 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
 		  void *vaddr, enum result_type type, int setjmp)
 {
 	unsigned long size;
+	unsigned long nr_hugetlb_pages;
 	uint64_t pfn_flags;
 
 	switch (type) {
 	case MADV_SOFT_ANON:
-	case MADV_HARD_CLEAN_PAGECACHE:
 	case MADV_SOFT_CLEAN_PAGECACHE:
 	case MADV_SOFT_DIRTY_PAGECACHE:
 		/* It is not expected to receive a SIGBUS signal. */
@@ -174,6 +178,7 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
 		ASSERT_NE(pagemap_get_pfn(self->pagemap_fd, vaddr), self->pfn);
 		break;
 	case MADV_HARD_ANON:
+	case MADV_HARD_ANON_HUGETLB:
 	case MADV_HARD_DIRTY_PAGECACHE:
 		/* The SIGBUS signal should have been received. */
 		ASSERT_EQ(setjmp, 1);
@@ -183,17 +188,36 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
 		ASSERT_EQ(siginfo.si_code, BUS_MCEERR_AR);
 		ASSERT_EQ(1UL << siginfo.si_addr_lsb, self->page_size);
 		ASSERT_EQ(siginfo.si_addr, vaddr);
-
-		/* XXX Check backing pte is hwpoison entry when supported. */
-		ASSERT_TRUE(pagemap_is_swapped(self->pagemap_fd, vaddr));
 		break;
 	default:
 		SKIP(return, "unexpected inject type %d.\n", type);
 	}
 
+	if (type == MADV_HARD_ANON || type == MADV_HARD_DIRTY_PAGECACHE) {
+		/*
+		 * Check backing pte is hwpoison entry when supported.
+		 * Although try_to_unmap_one() also installs hwpoison entry
+		 * for HugeTLB, pagemap_hugetlb_range() doesn't parse
+		 * swap entries at all.
+		 */
+		ASSERT_TRUE(pagemap_is_swapped(self->pagemap_fd, vaddr));
+	}
+
 	/* Check if the value of HardwareCorrupted has increased. */
 	ASSERT_EQ(get_hardware_corrupted_size(&size), 0);
-	ASSERT_EQ(size, self->corrupted_size + self->page_size / 1024);
+
+	if (type == MADV_HARD_ANON_HUGETLB) {
+		/*
+		 * Only one page is hardware corrupted; the rest should all be
+		 * released to buddy allocator.
+		 */
+		ASSERT_EQ(size, self->corrupted_size + getpagesize() / 1024);
+		/* HugeTLB should have lost the HWPoison HugeTLB page. */
+		nr_hugetlb_pages = hugetlb_nr_default_pages();
+		ASSERT_EQ(nr_hugetlb_pages + 1, self->nr_hugetlb_pages);
+	} else {
+		ASSERT_EQ(size, self->corrupted_size + self->page_size / 1024);
+	}
 
 	/* Check if HWPoison flag is set. */
 	ASSERT_EQ(pageflags_get(self->pfn, self->kpageflags_fd, &pfn_flags), 0);
@@ -247,6 +271,45 @@ TEST_F(memory_failure, anon)
 	ASSERT_EQ(munmap(addr, self->page_size), 0);
 }
 
+TEST_F(memory_failure, anon_hugetlb)
+{
+	char *addr;
+	int ret;
+	const unsigned long nr_alloc_hugetlb_pages = 4;
+	unsigned long alloc_size;
+
+	if (variant->type == MADV_SOFT)
+		SKIP(return, "Soft offline test is not implemented");
+
+	/* HugeTLB settings will be automatically restored when test exits. */
+	hugetlb_setup_default(nr_alloc_hugetlb_pages);
+
+	alloc_size = default_huge_page_size() * nr_alloc_hugetlb_pages;
+	self->page_size = default_huge_page_size();
+	self->nr_hugetlb_pages = hugetlb_nr_default_pages();
+
+	addr = mmap(0, alloc_size, PROT_READ | PROT_WRITE,
+		    MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);
+	if (addr == MAP_FAILED)
+		SKIP(return, "mmap failed, not enough memory or 1G hugetlb not supported.\n");
+	memset(addr, 0xce, alloc_size);
+
+	prepare(_metadata, self, addr);
+
+	ret = sigsetjmp(signal_jmp_buf, 1);
+	if (!self->triggered) {
+		self->triggered = true;
+		ASSERT_EQ(variant->inject(self, addr), 0);
+		FORCE_READ(*addr);
+	}
+
+	check(_metadata, self, addr, MADV_HARD_ANON_HUGETLB, ret);
+
+	cleanup(_metadata, self, addr);
+
+	ASSERT_EQ(munmap(addr, alloc_size), 0);
+}
+
 static int prepare_file(const char *fname, unsigned long size)
 {
 	int fd;
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test
  2026-05-31  5:58 ` [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test Jiaqi Yan
@ 2026-06-01 18:04   ` Jiaqi Yan
  2026-06-17  7:38   ` Miaohe Lin
  1 sibling, 0 replies; 20+ messages in thread
From: Jiaqi Yan @ 2026-06-01 18:04 UTC (permalink / raw)
  To: ljs, linmiaohe, osalvador, ziy, harry.yoo, willy
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn

For the reviews from sashiko
(https://sashiko.dev/#/patchset/20260531055829.3636554-1-jiaqiyan@google.com):

On Sat, May 30, 2026 at 10:58 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> Add a new testcase to validate memory failure recovery for HWPoison
> anonymous 1G HugeTLB page, including proper SIGBUS delivery,

This test is no longer 1G specific, I will update the commit msg accordingly.

> releasing a 1G HugeTLB page containing one HWPoison page to buddy
> allocator, and isolation of the raw HWPoison page.
>
> Although can be added in future, this patch does not support testing
> the MADV_SOFT variant.
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  tools/testing/selftests/mm/memory-failure.c | 73 +++++++++++++++++++--
>  1 file changed, 68 insertions(+), 5 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/memory-failure.c b/tools/testing/selftests/mm/memory-failure.c
> index 032ed952057c..ea43b2877c81 100644
> --- a/tools/testing/selftests/mm/memory-failure.c
> +++ b/tools/testing/selftests/mm/memory-failure.c
> @@ -18,6 +18,7 @@
>  #include <linux/magic.h>
>  #include <errno.h>
>
> +#include "hugepage_settings.h"
>  #include "vm_util.h"
>
>  enum inject_type {
> @@ -27,6 +28,7 @@ enum inject_type {
>
>  enum result_type {
>         MADV_HARD_ANON,
> +       MADV_HARD_ANON_HUGETLB,
>         MADV_HARD_CLEAN_PAGECACHE,
>         MADV_HARD_DIRTY_PAGECACHE,
>         MADV_SOFT_ANON,
> @@ -47,6 +49,8 @@ FIXTURE(memory_failure)
>         int pagemap_fd;
>         int kpageflags_fd;
>         bool triggered;
> +       /* Number of initial HugeTLB pages with default page size. */
> +       unsigned long nr_hugetlb_pages;
>  };
>
>  FIXTURE_VARIANT(memory_failure)
> @@ -157,11 +161,11 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
>                   void *vaddr, enum result_type type, int setjmp)
>  {
>         unsigned long size;
> +       unsigned long nr_hugetlb_pages;
>         uint64_t pfn_flags;
>
>         switch (type) {
>         case MADV_SOFT_ANON:
> -       case MADV_HARD_CLEAN_PAGECACHE:

Sorry for the unintended change, I will revert it in the next revision.

>         case MADV_SOFT_CLEAN_PAGECACHE:
>         case MADV_SOFT_DIRTY_PAGECACHE:
>                 /* It is not expected to receive a SIGBUS signal. */
> @@ -174,6 +178,7 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
>                 ASSERT_NE(pagemap_get_pfn(self->pagemap_fd, vaddr), self->pfn);
>                 break;
>         case MADV_HARD_ANON:
> +       case MADV_HARD_ANON_HUGETLB:
>         case MADV_HARD_DIRTY_PAGECACHE:
>                 /* The SIGBUS signal should have been received. */
>                 ASSERT_EQ(setjmp, 1);
> @@ -183,17 +188,36 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
>                 ASSERT_EQ(siginfo.si_code, BUS_MCEERR_AR);
>                 ASSERT_EQ(1UL << siginfo.si_addr_lsb, self->page_size);
>                 ASSERT_EQ(siginfo.si_addr, vaddr);
> -
> -               /* XXX Check backing pte is hwpoison entry when supported. */
> -               ASSERT_TRUE(pagemap_is_swapped(self->pagemap_fd, vaddr));
>                 break;
>         default:
>                 SKIP(return, "unexpected inject type %d.\n", type);
>         }
>
> +       if (type == MADV_HARD_ANON || type == MADV_HARD_DIRTY_PAGECACHE) {
> +               /*
> +                * Check backing pte is hwpoison entry when supported.
> +                * Although try_to_unmap_one() also installs hwpoison entry
> +                * for HugeTLB, pagemap_hugetlb_range() doesn't parse
> +                * swap entries at all.
> +                */
> +               ASSERT_TRUE(pagemap_is_swapped(self->pagemap_fd, vaddr));
> +       }
> +
>         /* Check if the value of HardwareCorrupted has increased. */
>         ASSERT_EQ(get_hardware_corrupted_size(&size), 0);
> -       ASSERT_EQ(size, self->corrupted_size + self->page_size / 1024);
> +
> +       if (type == MADV_HARD_ANON_HUGETLB) {
> +               /*
> +                * Only one page is hardware corrupted; the rest should all be
> +                * released to buddy allocator.
> +                */
> +               ASSERT_EQ(size, self->corrupted_size + getpagesize() / 1024);
> +               /* HugeTLB should have lost the HWPoison HugeTLB page. */
> +               nr_hugetlb_pages = hugetlb_nr_default_pages();
> +               ASSERT_EQ(nr_hugetlb_pages + 1, self->nr_hugetlb_pages);
> +       } else {
> +               ASSERT_EQ(size, self->corrupted_size + self->page_size / 1024);
> +       }
>
>         /* Check if HWPoison flag is set. */
>         ASSERT_EQ(pageflags_get(self->pfn, self->kpageflags_fd, &pfn_flags), 0);
> @@ -247,6 +271,45 @@ TEST_F(memory_failure, anon)
>         ASSERT_EQ(munmap(addr, self->page_size), 0);
>  }
>
> +TEST_F(memory_failure, anon_hugetlb)
> +{
> +       char *addr;
> +       int ret;
> +       const unsigned long nr_alloc_hugetlb_pages = 4;
> +       unsigned long alloc_size;
> +
> +       if (variant->type == MADV_SOFT)
> +               SKIP(return, "Soft offline test is not implemented");
> +
> +       /* HugeTLB settings will be automatically restored when test exits. */
> +       hugetlb_setup_default(nr_alloc_hugetlb_pages);
> +
> +       alloc_size = default_huge_page_size() * nr_alloc_hugetlb_pages;
> +       self->page_size = default_huge_page_size();
> +       self->nr_hugetlb_pages = hugetlb_nr_default_pages();
> +
> +       addr = mmap(0, alloc_size, PROT_READ | PROT_WRITE,
> +                   MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);
> +       if (addr == MAP_FAILED)
> +               SKIP(return, "mmap failed, not enough memory or 1G hugetlb not supported.\n");
> +       memset(addr, 0xce, alloc_size);
> +
> +       prepare(_metadata, self, addr);
> +
> +       ret = sigsetjmp(signal_jmp_buf, 1);
> +       if (!self->triggered) {
> +               self->triggered = true;
> +               ASSERT_EQ(variant->inject(self, addr), 0);
> +               FORCE_READ(*addr);
> +       }
> +
> +       check(_metadata, self, addr, MADV_HARD_ANON_HUGETLB, ret);
> +
> +       cleanup(_metadata, self, addr);
> +
> +       ASSERT_EQ(munmap(addr, alloc_size), 0);
> +}
> +
>  static int prepare_file(const char *fname, unsigned long size)
>  {
>         int fd;
> --
> 2.54.0.823.g6e5bcc1fc9-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-05-31  5:58 ` [PATCH v5 1/4] mm/page_alloc: only " Jiaqi Yan
@ 2026-06-09  3:44   ` Miaohe Lin
  2026-06-12 18:34     ` Zi Yan
  2026-06-15  2:03     ` Jiaqi Yan
  0 siblings, 2 replies; 20+ messages in thread
From: Miaohe Lin @ 2026-06-09  3:44 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, ziy, harry.yoo, willy

On 2026/5/31 13:58, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> becomes non-HugeTLB, and it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio(). However, there is always a time
> window between dissolve_free_hugetlb_folio() frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
> 
> Another similar situation is when a transparent huge page (THP)
> runs into memory failure but splitting failed. Such THP will
> eventually be released to buddy allocator when owning userspace
> processes are gone, but with certain subpages having HWPoison.
> 
> One obvious way to avoid both problems is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce free_has_hwpoisoned() to only free the healthy pages
> and to exclude the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages.
> 
> free_has_hwpoisoned() is added in free_pages_prepare() as
> a shortcut and is only invoked if PG_has_hwpoisoned indicates
> HWPoison page exists and after checks and preparations in
> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
> can re-use free_prepared_contig_range() [4] to decompose healthy
> ranges into the largest possible chunks of different orders.
> Every chunk meets the requirements to be freed via free_one_page().
> 
> free_has_hwpoisoned() has linear time complexity wrt the number
> of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison() defines the
> overall time complexity. For a 1G hugepage having 8 HWPoison
> pages, free_has_hwpoisoned() takes around 1ms on average on
> a system having 56 Intel Skylake physical cores. This is
> 15x to the case of freeing no HWPoison page. The cost is far
> from triggering soft lockup, and fair for handling exceptional
> hardware memory errors.
> 
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>

Thanks for your update. This patch looks good to me while some comments below.

> ---
>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 85 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e47679e7a9db..03df929abca6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>  unsigned int pageblock_order __read_mostly;
>  #endif
>  
> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>  static void __free_pages_ok(struct page *page, unsigned int order,
>  			    fpi_t fpi_flags);
>  static void reserve_highatomic_pageblock(struct page *page, int order,
> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>  
>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>  
> +/*
> + * Returns
> + * - true: checks and preparations all good, caller can proceed freeing.
> + * - false: do not proceed freeing for one of the following reasons:
> + *   1. Some check failed so it is not safe to proceed freeing.
> + *   2. A compound page has some HWPoison pages. The healthy pages
> + *      are already safely freed, and the HWPoison ones isolated.
> + */
>  static __always_inline bool __free_pages_prepare(struct page *page,
>  		unsigned int order, fpi_t fpi_flags)
>  {
> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>  	bool init = want_init_on_free();
>  	bool compound = PageCompound(page);
>  	struct folio *folio = page_folio(page);
> +	/*
> +	 * When dealing with compound page, PG_has_hwpoisoned is cleared
> +	 * with PAGE_FLAGS_SECOND. So the check must be done first.
> +	 *
> +	 * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
> +	 * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
> +	 * confuse and complaint that the first tail page is still active.
> +	 */
> +	bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>  
>  	if (fpi_flags & FPI_PREPARED)
>  		return true;
> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>  
>  	debug_pagealloc_unmap_pages(page, 1 << order);
>  
> +	/*
> +	 * After breaking down compound page and dealing with page metadata
> +	 * (e.g. page owner and page alloc tags), take a shortcut if this
> +	 * was a compound page containing certain HWPoison subpages.
> +	 */
> +	if (should_fhh) {
> +		free_has_hwpoisoned(page, order);
> +		return false;
> +	}

When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
the hwpoisoned pages. Is it safe to do so?

> +
>  	return true;
>  }
>  
> @@ -6936,6 +6964,63 @@ void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>  	__free_contig_range_common(pfn, nr_pages, /* is_frozen= */ false);
>  }
>  
> +/*
> + * Given a high-order compound page containing certain number of HWPoison
> + * pages, free only the healthy ones.
> + *
> + * Pages must have passed free_pages_prepare(). Even if having HWPoison
> + * pages, breaking down compound page and updating metadata (e.g. page
> + * owner, alloc tag) can be done together during free_pages_prepare(),
> + * which simplifies the splitting here: unlike __split_unmapped_folio(),
> + * there is no need to turn split pages into a compound page or to carry
> + * metadata.
> + *
> + * It scans every raw page of the compound page and cause nontrivial overhead.
> + * So only use this when the compound page contains HWPoison page(s).
> + *
> + * This implementation needs rework in memdesc world.
> + */
> +static void free_has_hwpoisoned(struct page *page, unsigned int order)
> +{
> +	unsigned long curr = page_to_pfn(page);
> +	unsigned long end_pfn = curr + (1 << order);
> +	unsigned long next;
> +	unsigned long total_freed = 0;
> +	unsigned long total_hwp = 0;
> +
> +	VM_WARN_ON(order == 0);
> +	VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
> +
> +	while (curr < end_pfn) {
> +		next = curr;
> +
> +		while (next < end_pfn && !PageHWPoison(pfn_to_page(next)))
> +			++next;
> +
> +		if (next != end_pfn && PageHWPoison(pfn_to_page(next))) {

Check next != end_pfn should be enough. If we have next != end_pfn, we must have PageHWPoison(pfn_to_page(next))
or we can't exit from above while loop. Or am I miss something?

Thanks.
.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio
  2026-05-31  5:58 ` [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio Jiaqi Yan
@ 2026-06-09  6:34   ` Miaohe Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Miaohe Lin @ 2026-06-09  6:34 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, ziy, harry.yoo, willy

On 2026/5/31 13:58, Jiaqi Yan wrote:
> When a free HWPoison HugeTLB folio is dissolved, it becomes
> non-HugeTLB and is released to buddy allocator as a high-order
> folio.
> 
> Set has_hwpoisoned flags on the high-order folio so that buddy
> allocator can tell that it contains certain HWPoison page(s),
> and can handle it specially with free_has_hwpoisoned().
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>

Acked-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks.
.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page
  2026-05-31  5:58 ` [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page Jiaqi Yan
@ 2026-06-09  7:21   ` Miaohe Lin
  2026-06-15  0:16     ` Jiaqi Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Miaohe Lin @ 2026-06-09  7:21 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, ziy, harry.yoo, willy

On 2026/5/31 13:58, Jiaqi Yan wrote:
> Now that HWPoison subpage(s) within HugeTLB page will be rejected by
> buddy allocator during dissolve_free_hugetlb_folio(), there is no
> need to drain_all_pages() and take_page_off_buddy() anymore. In fact,
> calling take_page_off_buddy() after dissolve_free_hugetlb_folio()
> succeeded returns false, making caller think __page_handle_poison()
> failed.
> 
> Add __hugepage_handle_poison() and replace __page_handle_poison() at
> HugeTLB specific call sites. The being handled HugeTLB page either
> is free at the moment of try_memory_failure_hugetlb(), or becomes
> free at the moment of me_huge_page().
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++------
>  1 file changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 95979b7995c1..098c4407e818 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -163,6 +163,30 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
>  static DEFINE_MUTEX(pfn_space_lock);
>  
>  /*
> + * Only for a HugeTLB page being handled by memory_failure(). The key
> + * difference to soft_offline() is that, no HWPoison subpage will make
> + * into buddy allocator after a successful dissolve_free_hugetlb_folio(),
> + * so take_page_off_buddy() is unnecessary.
> + */
> +static int __hugepage_handle_poison(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	VM_WARN_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> +
> +	/*
> +	 * Can't use dissolve_free_hugetlb_folio() without a reliable
> +	 * raw_hwp_list telling which subpage is HWPoison.

This reminds me that hugetlb_raw_hwp_unreliable folios can be freed into buddy yet?
Should we handle them too?

> +	 */
> +	if (folio_test_hugetlb_raw_hwp_unreliable(folio))
> +		/* raw_hwp_list becomes unreliable when kmalloc() fails. */
> +		return -ENOMEM;

If folios have hugetlb_raw_hwp_unreliable set, hugetlb_update_hwpoison will return
MF_HUGETLB_FOLIO_PRE_POISONED thus these folios cannot reach here, e.g. me_huge_page.
The only way these folios can reach here is that they are hwpoisoned first time so
hugetlb_update_hwpoison returns 0 even if hugetlb_raw_hwp_unreliable is set at the same
time. In that case, we can simply dissolve the hugetlb folios and then take the sole hwpoisoned
@page off buddy? But this might not be a good idea as it is really fragile...

Thanks.
.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-09  3:44   ` Miaohe Lin
@ 2026-06-12 18:34     ` Zi Yan
  2026-06-16  3:23       ` Jiaqi Yan
  2026-06-15  2:03     ` Jiaqi Yan
  1 sibling, 1 reply; 20+ messages in thread
From: Zi Yan @ 2026-06-12 18:34 UTC (permalink / raw)
  To: Miaohe Lin, Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, harry.yoo, willy

On 8 Jun 2026, at 23:44, Miaohe Lin wrote:

> On 2026/5/31 13:58, Jiaqi Yan wrote:
>> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>> becomes non-HugeTLB, and it is released to buddy allocator
>> as a high-order folio, e.g. a folio that contains 262144 pages
>> if the folio was a 1G HugeTLB hugepage.
>>
>> This is problematic if the HugeTLB hugepage contained HWPoison
>> subpages. In that case, since buddy allocator does not check
>> HWPoison for non-zero-order folio, the raw HWPoison page can
>> be given out with its buddy page and be re-used by either
>> kernel or userspace.
>>
>> Memory failure recovery (MFR) in kernel does attempt to take
>> raw HWPoison page off buddy allocator after
>> dissolve_free_hugetlb_folio(). However, there is always a time
>> window between dissolve_free_hugetlb_folio() frees a HWPoison
>> high-order folio to buddy allocator and MFR takes HWPoison
>> raw page off buddy allocator.
>>
>> Another similar situation is when a transparent huge page (THP)
>> runs into memory failure but splitting failed. Such THP will
>> eventually be released to buddy allocator when owning userspace
>> processes are gone, but with certain subpages having HWPoison.
>>
>> One obvious way to avoid both problems is to add page sanity
>> checks in page allocate or free path. However, it is against
>> the past efforts to reduce sanity check overhead [1,2,3].
>>
>> Introduce free_has_hwpoisoned() to only free the healthy pages
>> and to exclude the HWPoison ones in the high-order folio.
>> The idea is to iterate through the sub-pages of the folio to
>> identify contiguous ranges of healthy pages.
>>
>> free_has_hwpoisoned() is added in free_pages_prepare() as
>> a shortcut and is only invoked if PG_has_hwpoisoned indicates
>> HWPoison page exists and after checks and preparations in
>> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
>> can re-use free_prepared_contig_range() [4] to decompose healthy
>> ranges into the largest possible chunks of different orders.
>> Every chunk meets the requirements to be freed via free_one_page().
>>
>> free_has_hwpoisoned() has linear time complexity wrt the number
>> of pages in the folio. While the power-of-two decomposition
>> ensures that the number of calls to the buddy allocator is
>> logarithmic for each contiguous healthy range, the mandatory
>> linear scan of pages to identify PageHWPoison() defines the
>> overall time complexity. For a 1G hugepage having 8 HWPoison
>> pages, free_has_hwpoisoned() takes around 1ms on average on
>> a system having 56 Intel Skylake physical cores. This is
>> 15x to the case of freeing no HWPoison page. The cost is far
>> from triggering soft lockup, and fair for handling exceptional
>> hardware memory errors.
>>
>> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
>>
>> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>
> Thanks for your update. This patch looks good to me while some comments below.
>
>> ---
>>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 85 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e47679e7a9db..03df929abca6 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>>  unsigned int pageblock_order __read_mostly;
>>  #endif
>>
>> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>>  static void __free_pages_ok(struct page *page, unsigned int order,
>>  			    fpi_t fpi_flags);
>>  static void reserve_highatomic_pageblock(struct page *page, int order,
>> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>>
>>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>>
>> +/*
>> + * Returns
>> + * - true: checks and preparations all good, caller can proceed freeing.
>> + * - false: do not proceed freeing for one of the following reasons:
>> + *   1. Some check failed so it is not safe to proceed freeing.
>> + *   2. A compound page has some HWPoison pages. The healthy pages
>> + *      are already safely freed, and the HWPoison ones isolated.
>> + */
>>  static __always_inline bool __free_pages_prepare(struct page *page,
>>  		unsigned int order, fpi_t fpi_flags)
>>  {
>> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>  	bool init = want_init_on_free();
>>  	bool compound = PageCompound(page);
>>  	struct folio *folio = page_folio(page);
>> +	/*
>> +	 * When dealing with compound page, PG_has_hwpoisoned is cleared
>> +	 * with PAGE_FLAGS_SECOND. So the check must be done first.
>> +	 *
>> +	 * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
>> +	 * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
>> +	 * confuse and complaint that the first tail page is still active.
>> +	 */
>> +	bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>>
>>  	if (fpi_flags & FPI_PREPARED)
>>  		return true;
>> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>
>>  	debug_pagealloc_unmap_pages(page, 1 << order);
>>
>> +	/*
>> +	 * After breaking down compound page and dealing with page metadata
>> +	 * (e.g. page owner and page alloc tags), take a shortcut if this
>> +	 * was a compound page containing certain HWPoison subpages.
>> +	 */
>> +	if (should_fhh) {
>> +		free_has_hwpoisoned(page, order);
>> +		return false;
>> +	}
>
> When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
> kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
> the hwpoisoned pages. Is it safe to do so?

At least, kernel_poison_pages() writes to the page. It probably should be
moved up, somewhere like above kernel_poison_pages().

I do not like the shortcut method, since the pages are freed in
__free_pages_prepare(). This causes confusion. One alternative I can think
of is to make __free_pages_prepare() returns a enum
{ FREE_PAGE_PREPARE_SUCCESS, FREE_PAGE_PREPARE_FAIL, FREE_PAGE_PREPARE_HAS_HWPOISON}
and handle the return value in the caller.


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page
  2026-06-09  7:21   ` Miaohe Lin
@ 2026-06-15  0:16     ` Jiaqi Yan
  2026-06-16  7:05       ` Miaohe Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2026-06-15  0:16 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, ziy, harry.yoo, willy

On Tue, Jun 9, 2026 at 12:21 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2026/5/31 13:58, Jiaqi Yan wrote:
> > Now that HWPoison subpage(s) within HugeTLB page will be rejected by
> > buddy allocator during dissolve_free_hugetlb_folio(), there is no
> > need to drain_all_pages() and take_page_off_buddy() anymore. In fact,
> > calling take_page_off_buddy() after dissolve_free_hugetlb_folio()
> > succeeded returns false, making caller think __page_handle_poison()
> > failed.
> >
> > Add __hugepage_handle_poison() and replace __page_handle_poison() at
> > HugeTLB specific call sites. The being handled HugeTLB page either
> > is free at the moment of try_memory_failure_hugetlb(), or becomes
> > free at the moment of me_huge_page().
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++------
> >  1 file changed, 30 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 95979b7995c1..098c4407e818 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -163,6 +163,30 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
> >  static DEFINE_MUTEX(pfn_space_lock);
> >
> >  /*
> > + * Only for a HugeTLB page being handled by memory_failure(). The key
> > + * difference to soft_offline() is that, no HWPoison subpage will make
> > + * into buddy allocator after a successful dissolve_free_hugetlb_folio(),
> > + * so take_page_off_buddy() is unnecessary.
> > + */
> > +static int __hugepage_handle_poison(struct page *page)
> > +{
> > +     struct folio *folio = page_folio(page);
> > +
> > +     VM_WARN_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> > +
> > +     /*
> > +      * Can't use dissolve_free_hugetlb_folio() without a reliable
> > +      * raw_hwp_list telling which subpage is HWPoison.
>
> This reminds me that hugetlb_raw_hwp_unreliable folios can be freed into buddy yet?
> Should we handle them too?

I think we can (very likely already?) handle such
hugetlb_raw_hwp_unreliable folios by "leaking" them: neither freeing
them to the buddy allocator nor allocating them in
dequeue_hugetlb_folio_node_exact().

>
> > +      */
> > +     if (folio_test_hugetlb_raw_hwp_unreliable(folio))
> > +             /* raw_hwp_list becomes unreliable when kmalloc() fails. */
> > +             return -ENOMEM;
>
> If folios have hugetlb_raw_hwp_unreliable set, hugetlb_update_hwpoison will return
> MF_HUGETLB_FOLIO_PRE_POISONED thus these folios cannot reach here, e.g. me_huge_page.
> The only way these folios can reach here is that they are hwpoisoned first time so
> hugetlb_update_hwpoison returns 0 even if hugetlb_raw_hwp_unreliable is set at the same

For first time HWPoison-ed hugetlb folio, hugetlb_update_hwpoison()
still returns rc=0 when kmalloc_obj() fails. So
get_huge_page_for_hwpoison() won't override ret with rc. IOW, ret will
be MF_HUGETLB_IN_USED or MF_HUGETLB_FREED. I believe
MF_HUGETLB_IN_USED can still get into me_huge_page(). MF_HUGETLB_FREED
goes directly to __hugepage_handle_poison().

So my intent is to block both places from dissolve_free_hugetlb_folio().

> time. In that case, we can simply dissolve the hugetlb folios and then take the sole hwpoisoned
> @page off buddy? But this might not be a good idea as it is really fragile...

I think it is fragile too, and would prefer the leaking approach.
What's your thoughts?

>
> Thanks.
> .

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-09  3:44   ` Miaohe Lin
  2026-06-12 18:34     ` Zi Yan
@ 2026-06-15  2:03     ` Jiaqi Yan
  1 sibling, 0 replies; 20+ messages in thread
From: Jiaqi Yan @ 2026-06-15  2:03 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, ziy, harry.yoo, willy

On Mon, Jun 8, 2026 at 8:44 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2026/5/31 13:58, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> > becomes non-HugeTLB, and it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio(). However, there is always a time
> > window between dissolve_free_hugetlb_folio() frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
> >
> > Another similar situation is when a transparent huge page (THP)
> > runs into memory failure but splitting failed. Such THP will
> > eventually be released to buddy allocator when owning userspace
> > processes are gone, but with certain subpages having HWPoison.
> >
> > One obvious way to avoid both problems is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoisoned() to only free the healthy pages
> > and to exclude the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages.
> >
> > free_has_hwpoisoned() is added in free_pages_prepare() as
> > a shortcut and is only invoked if PG_has_hwpoisoned indicates
> > HWPoison page exists and after checks and preparations in
> > free_pages_prepare() all succeeded. free_has_hwpoisoned() then
> > can re-use free_prepared_contig_range() [4] to decompose healthy
> > ranges into the largest possible chunks of different orders.
> > Every chunk meets the requirements to be freed via free_one_page().
> >
> > free_has_hwpoisoned() has linear time complexity wrt the number
> > of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison() defines the
> > overall time complexity. For a 1G hugepage having 8 HWPoison
> > pages, free_has_hwpoisoned() takes around 1ms on average on
> > a system having 56 Intel Skylake physical cores. This is
> > 15x to the case of freeing no HWPoison page. The cost is far
> > from triggering soft lockup, and fair for handling exceptional
> > hardware memory errors.
> >
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>
> Thanks for your update. This patch looks good to me while some comments below.

Thanks for your review Miaohe!

>
> > ---
> >  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 85 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e47679e7a9db..03df929abca6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> >  unsigned int pageblock_order __read_mostly;
> >  #endif
> >
> > +static void free_has_hwpoisoned(struct page *page, unsigned int order);
> >  static void __free_pages_ok(struct page *page, unsigned int order,
> >                           fpi_t fpi_flags);
> >  static void reserve_highatomic_pageblock(struct page *page, int order,
> > @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
> >
> >  #endif /* CONFIG_MEM_ALLOC_PROFILING */
> >
> > +/*
> > + * Returns
> > + * - true: checks and preparations all good, caller can proceed freeing.
> > + * - false: do not proceed freeing for one of the following reasons:
> > + *   1. Some check failed so it is not safe to proceed freeing.
> > + *   2. A compound page has some HWPoison pages. The healthy pages
> > + *      are already safely freed, and the HWPoison ones isolated.
> > + */
> >  static __always_inline bool __free_pages_prepare(struct page *page,
> >               unsigned int order, fpi_t fpi_flags)
> >  {
> > @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
> >       bool init = want_init_on_free();
> >       bool compound = PageCompound(page);
> >       struct folio *folio = page_folio(page);
> > +     /*
> > +      * When dealing with compound page, PG_has_hwpoisoned is cleared
> > +      * with PAGE_FLAGS_SECOND. So the check must be done first.
> > +      *
> > +      * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
> > +      * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
> > +      * confuse and complaint that the first tail page is still active.
> > +      */
> > +     bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
> >
> >       if (fpi_flags & FPI_PREPARED)
> >               return true;
> > @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
> >
> >       debug_pagealloc_unmap_pages(page, 1 << order);
> >
> > +     /*
> > +      * After breaking down compound page and dealing with page metadata
> > +      * (e.g. page owner and page alloc tags), take a shortcut if this
> > +      * was a compound page containing certain HWPoison subpages.
> > +      */
> > +     if (should_fhh) {
> > +             free_has_hwpoisoned(page, order);
> > +             return false;
> > +     }
>
> When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
> kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
> the hwpoisoned pages. Is it safe to do so?
>
> > +
> >       return true;
> >  }
> >
> > @@ -6936,6 +6964,63 @@ void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
> >       __free_contig_range_common(pfn, nr_pages, /* is_frozen= */ false);
> >  }
> >
> > +/*
> > + * Given a high-order compound page containing certain number of HWPoison
> > + * pages, free only the healthy ones.
> > + *
> > + * Pages must have passed free_pages_prepare(). Even if having HWPoison
> > + * pages, breaking down compound page and updating metadata (e.g. page
> > + * owner, alloc tag) can be done together during free_pages_prepare(),
> > + * which simplifies the splitting here: unlike __split_unmapped_folio(),
> > + * there is no need to turn split pages into a compound page or to carry
> > + * metadata.
> > + *
> > + * It scans every raw page of the compound page and cause nontrivial overhead.
> > + * So only use this when the compound page contains HWPoison page(s).
> > + *
> > + * This implementation needs rework in memdesc world.
> > + */
> > +static void free_has_hwpoisoned(struct page *page, unsigned int order)
> > +{
> > +     unsigned long curr = page_to_pfn(page);
> > +     unsigned long end_pfn = curr + (1 << order);
> > +     unsigned long next;
> > +     unsigned long total_freed = 0;
> > +     unsigned long total_hwp = 0;
> > +
> > +     VM_WARN_ON(order == 0);
> > +     VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
> > +
> > +     while (curr < end_pfn) {
> > +             next = curr;
> > +
> > +             while (next < end_pfn && !PageHWPoison(pfn_to_page(next)))
> > +                     ++next;
> > +
> > +             if (next != end_pfn && PageHWPoison(pfn_to_page(next))) {
>
> Check next != end_pfn should be enough. If we have next != end_pfn, we must have PageHWPoison(pfn_to_page(next))
> or we can't exit from above while loop. Or am I miss something?

Yeah, next!=end_pfn is simple and enough.

>
> Thanks.
> .

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-12 18:34     ` Zi Yan
@ 2026-06-16  3:23       ` Jiaqi Yan
  2026-06-16  6:53         ` Miaohe Lin
  2026-06-17  1:56         ` Zi Yan
  0 siblings, 2 replies; 20+ messages in thread
From: Jiaqi Yan @ 2026-06-16  3:23 UTC (permalink / raw)
  To: Zi Yan, Miaohe Lin, harry.yoo
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, willy

On Fri, Jun 12, 2026 at 11:34 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 8 Jun 2026, at 23:44, Miaohe Lin wrote:
>
> > On 2026/5/31 13:58, Jiaqi Yan wrote:
> >> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> >> becomes non-HugeTLB, and it is released to buddy allocator
> >> as a high-order folio, e.g. a folio that contains 262144 pages
> >> if the folio was a 1G HugeTLB hugepage.
> >>
> >> This is problematic if the HugeTLB hugepage contained HWPoison
> >> subpages. In that case, since buddy allocator does not check
> >> HWPoison for non-zero-order folio, the raw HWPoison page can
> >> be given out with its buddy page and be re-used by either
> >> kernel or userspace.
> >>
> >> Memory failure recovery (MFR) in kernel does attempt to take
> >> raw HWPoison page off buddy allocator after
> >> dissolve_free_hugetlb_folio(). However, there is always a time
> >> window between dissolve_free_hugetlb_folio() frees a HWPoison
> >> high-order folio to buddy allocator and MFR takes HWPoison
> >> raw page off buddy allocator.
> >>
> >> Another similar situation is when a transparent huge page (THP)
> >> runs into memory failure but splitting failed. Such THP will
> >> eventually be released to buddy allocator when owning userspace
> >> processes are gone, but with certain subpages having HWPoison.
> >>
> >> One obvious way to avoid both problems is to add page sanity
> >> checks in page allocate or free path. However, it is against
> >> the past efforts to reduce sanity check overhead [1,2,3].
> >>
> >> Introduce free_has_hwpoisoned() to only free the healthy pages
> >> and to exclude the HWPoison ones in the high-order folio.
> >> The idea is to iterate through the sub-pages of the folio to
> >> identify contiguous ranges of healthy pages.
> >>
> >> free_has_hwpoisoned() is added in free_pages_prepare() as
> >> a shortcut and is only invoked if PG_has_hwpoisoned indicates
> >> HWPoison page exists and after checks and preparations in
> >> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
> >> can re-use free_prepared_contig_range() [4] to decompose healthy
> >> ranges into the largest possible chunks of different orders.
> >> Every chunk meets the requirements to be freed via free_one_page().
> >>
> >> free_has_hwpoisoned() has linear time complexity wrt the number
> >> of pages in the folio. While the power-of-two decomposition
> >> ensures that the number of calls to the buddy allocator is
> >> logarithmic for each contiguous healthy range, the mandatory
> >> linear scan of pages to identify PageHWPoison() defines the
> >> overall time complexity. For a 1G hugepage having 8 HWPoison
> >> pages, free_has_hwpoisoned() takes around 1ms on average on
> >> a system having 56 Intel Skylake physical cores. This is
> >> 15x to the case of freeing no HWPoison page. The cost is far
> >> from triggering soft lockup, and fair for handling exceptional
> >> hardware memory errors.
> >>
> >> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> >> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> >> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
> >>
> >> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> >
> > Thanks for your update. This patch looks good to me while some comments below.
> >
> >> ---
> >>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 85 insertions(+)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index e47679e7a9db..03df929abca6 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> >>  unsigned int pageblock_order __read_mostly;
> >>  #endif
> >>
> >> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
> >>  static void __free_pages_ok(struct page *page, unsigned int order,
> >>                          fpi_t fpi_flags);
> >>  static void reserve_highatomic_pageblock(struct page *page, int order,
> >> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
> >>
> >>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
> >>
> >> +/*
> >> + * Returns
> >> + * - true: checks and preparations all good, caller can proceed freeing.
> >> + * - false: do not proceed freeing for one of the following reasons:
> >> + *   1. Some check failed so it is not safe to proceed freeing.
> >> + *   2. A compound page has some HWPoison pages. The healthy pages
> >> + *      are already safely freed, and the HWPoison ones isolated.
> >> + */
> >>  static __always_inline bool __free_pages_prepare(struct page *page,
> >>              unsigned int order, fpi_t fpi_flags)
> >>  {
> >> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
> >>      bool init = want_init_on_free();
> >>      bool compound = PageCompound(page);
> >>      struct folio *folio = page_folio(page);
> >> +    /*
> >> +     * When dealing with compound page, PG_has_hwpoisoned is cleared
> >> +     * with PAGE_FLAGS_SECOND. So the check must be done first.
> >> +     *
> >> +     * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
> >> +     * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
> >> +     * confuse and complaint that the first tail page is still active.
> >> +     */
> >> +    bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
> >>
> >>      if (fpi_flags & FPI_PREPARED)
> >>              return true;
> >> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
> >>
> >>      debug_pagealloc_unmap_pages(page, 1 << order);
> >>
> >> +    /*
> >> +     * After breaking down compound page and dealing with page metadata
> >> +     * (e.g. page owner and page alloc tags), take a shortcut if this
> >> +     * was a compound page containing certain HWPoison subpages.
> >> +     */
> >> +    if (should_fhh) {
> >> +            free_has_hwpoisoned(page, order);
> >> +            return false;
> >> +    }
> >
> > When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
> > kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
> > the hwpoisoned pages. Is it safe to do so?
>
> At least, kernel_poison_pages() writes to the page. It probably should be
> moved up, somewhere like above kernel_poison_pages().

Writing to HWPoison pages (location having memory error) is usually
safe, as in it doesn't cause a machine check exception. Memory
controller usually just fails the write op and waits for the next read
to raise the MCE / exception for prevent silent data corruption.

>
> I do not like the shortcut method, since the pages are freed in
> __free_pages_prepare(). This causes confusion. One alternative I can think

What exactly is the confusion? or why does freeing has to be done by
__free_pages_prepare()'s caller?

Harry suggested the shortcut method [1]. Although freeing inline might
surprise the caller, it simplifies things for all callers.

[1] https://lore.kernel.org/linux-mm/aVy7L-3pc4JUYBEn@hyeyoo

> of is to make __free_pages_prepare() returns a enum
> { FREE_PAGE_PREPARE_SUCCESS, FREE_PAGE_PREPARE_FAIL, FREE_PAGE_PREPARE_HAS_HWPOISON}
> and handle the return value in the caller.

Are you worried that a caller of __free_pages_prepare() may see false
returned in the has_hwpoison case, but mistakenly propagate some kind
of error, or doing something under the assumption that folio not
freed? In this case the three enums can be useful. But I checked
current callers of __free_pages_prepare() and they don't have the
above problem.

>
>
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-16  3:23       ` Jiaqi Yan
@ 2026-06-16  6:53         ` Miaohe Lin
  2026-06-18 15:02           ` Vlastimil Babka (SUSE)
  2026-06-17  1:56         ` Zi Yan
  1 sibling, 1 reply; 20+ messages in thread
From: Miaohe Lin @ 2026-06-16  6:53 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, willy, Zi Yan, harry.yoo

On 2026/6/16 11:23, Jiaqi Yan wrote:
> On Fri, Jun 12, 2026 at 11:34 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 8 Jun 2026, at 23:44, Miaohe Lin wrote:
>>
>>> On 2026/5/31 13:58, Jiaqi Yan wrote:
>>>> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>>>> becomes non-HugeTLB, and it is released to buddy allocator
>>>> as a high-order folio, e.g. a folio that contains 262144 pages
>>>> if the folio was a 1G HugeTLB hugepage.
>>>>
>>>> This is problematic if the HugeTLB hugepage contained HWPoison
>>>> subpages. In that case, since buddy allocator does not check
>>>> HWPoison for non-zero-order folio, the raw HWPoison page can
>>>> be given out with its buddy page and be re-used by either
>>>> kernel or userspace.
>>>>
>>>> Memory failure recovery (MFR) in kernel does attempt to take
>>>> raw HWPoison page off buddy allocator after
>>>> dissolve_free_hugetlb_folio(). However, there is always a time
>>>> window between dissolve_free_hugetlb_folio() frees a HWPoison
>>>> high-order folio to buddy allocator and MFR takes HWPoison
>>>> raw page off buddy allocator.
>>>>
>>>> Another similar situation is when a transparent huge page (THP)
>>>> runs into memory failure but splitting failed. Such THP will
>>>> eventually be released to buddy allocator when owning userspace
>>>> processes are gone, but with certain subpages having HWPoison.
>>>>
>>>> One obvious way to avoid both problems is to add page sanity
>>>> checks in page allocate or free path. However, it is against
>>>> the past efforts to reduce sanity check overhead [1,2,3].
>>>>
>>>> Introduce free_has_hwpoisoned() to only free the healthy pages
>>>> and to exclude the HWPoison ones in the high-order folio.
>>>> The idea is to iterate through the sub-pages of the folio to
>>>> identify contiguous ranges of healthy pages.
>>>>
>>>> free_has_hwpoisoned() is added in free_pages_prepare() as
>>>> a shortcut and is only invoked if PG_has_hwpoisoned indicates
>>>> HWPoison page exists and after checks and preparations in
>>>> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
>>>> can re-use free_prepared_contig_range() [4] to decompose healthy
>>>> ranges into the largest possible chunks of different orders.
>>>> Every chunk meets the requirements to be freed via free_one_page().
>>>>
>>>> free_has_hwpoisoned() has linear time complexity wrt the number
>>>> of pages in the folio. While the power-of-two decomposition
>>>> ensures that the number of calls to the buddy allocator is
>>>> logarithmic for each contiguous healthy range, the mandatory
>>>> linear scan of pages to identify PageHWPoison() defines the
>>>> overall time complexity. For a 1G hugepage having 8 HWPoison
>>>> pages, free_has_hwpoisoned() takes around 1ms on average on
>>>> a system having 56 Intel Skylake physical cores. This is
>>>> 15x to the case of freeing no HWPoison page. The cost is far
>>>> from triggering soft lockup, and fair for handling exceptional
>>>> hardware memory errors.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>>>> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>>>> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>>> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
>>>>
>>>> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>>
>>> Thanks for your update. This patch looks good to me while some comments below.
>>>
>>>> ---
>>>>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 85 insertions(+)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index e47679e7a9db..03df929abca6 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>>>>  unsigned int pageblock_order __read_mostly;
>>>>  #endif
>>>>
>>>> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>>>>  static void __free_pages_ok(struct page *page, unsigned int order,
>>>>                          fpi_t fpi_flags);
>>>>  static void reserve_highatomic_pageblock(struct page *page, int order,
>>>> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>>>>
>>>>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>>>>
>>>> +/*
>>>> + * Returns
>>>> + * - true: checks and preparations all good, caller can proceed freeing.
>>>> + * - false: do not proceed freeing for one of the following reasons:
>>>> + *   1. Some check failed so it is not safe to proceed freeing.
>>>> + *   2. A compound page has some HWPoison pages. The healthy pages
>>>> + *      are already safely freed, and the HWPoison ones isolated.
>>>> + */
>>>>  static __always_inline bool __free_pages_prepare(struct page *page,
>>>>              unsigned int order, fpi_t fpi_flags)
>>>>  {
>>>> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>>      bool init = want_init_on_free();
>>>>      bool compound = PageCompound(page);
>>>>      struct folio *folio = page_folio(page);
>>>> +    /*
>>>> +     * When dealing with compound page, PG_has_hwpoisoned is cleared
>>>> +     * with PAGE_FLAGS_SECOND. So the check must be done first.
>>>> +     *
>>>> +     * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
>>>> +     * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
>>>> +     * confuse and complaint that the first tail page is still active.
>>>> +     */
>>>> +    bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>>>>
>>>>      if (fpi_flags & FPI_PREPARED)
>>>>              return true;
>>>> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>>
>>>>      debug_pagealloc_unmap_pages(page, 1 << order);
>>>>
>>>> +    /*
>>>> +     * After breaking down compound page and dealing with page metadata
>>>> +     * (e.g. page owner and page alloc tags), take a shortcut if this
>>>> +     * was a compound page containing certain HWPoison subpages.
>>>> +     */
>>>> +    if (should_fhh) {
>>>> +            free_has_hwpoisoned(page, order);
>>>> +            return false;
>>>> +    }
>>>
>>> When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
>>> kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
>>> the hwpoisoned pages. Is it safe to do so?
>>
>> At least, kernel_poison_pages() writes to the page. It probably should be
>> moved up, somewhere like above kernel_poison_pages().
> 
> Writing to HWPoison pages (location having memory error) is usually
> safe, as in it doesn't cause a machine check exception. Memory

In x86, set_mce_nospec is called for hwpoisoned pages. So writing to
HWPoison pages would cause unexpected page fault in kernel?

Thanks.
.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page
  2026-06-15  0:16     ` Jiaqi Yan
@ 2026-06-16  7:05       ` Miaohe Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Miaohe Lin @ 2026-06-16  7:05 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, ziy, harry.yoo, willy

On 2026/6/15 8:16, Jiaqi Yan wrote:
> On Tue, Jun 9, 2026 at 12:21 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2026/5/31 13:58, Jiaqi Yan wrote:
>>> Now that HWPoison subpage(s) within HugeTLB page will be rejected by
>>> buddy allocator during dissolve_free_hugetlb_folio(), there is no
>>> need to drain_all_pages() and take_page_off_buddy() anymore. In fact,
>>> calling take_page_off_buddy() after dissolve_free_hugetlb_folio()
>>> succeeded returns false, making caller think __page_handle_poison()
>>> failed.
>>>
>>> Add __hugepage_handle_poison() and replace __page_handle_poison() at
>>> HugeTLB specific call sites. The being handled HugeTLB page either
>>> is free at the moment of try_memory_failure_hugetlb(), or becomes
>>> free at the moment of me_huge_page().
>>>
>>> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>> ---
>>>  mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++------
>>>  1 file changed, 30 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 95979b7995c1..098c4407e818 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -163,6 +163,30 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
>>>  static DEFINE_MUTEX(pfn_space_lock);
>>>
>>>  /*
>>> + * Only for a HugeTLB page being handled by memory_failure(). The key
>>> + * difference to soft_offline() is that, no HWPoison subpage will make
>>> + * into buddy allocator after a successful dissolve_free_hugetlb_folio(),
>>> + * so take_page_off_buddy() is unnecessary.
>>> + */
>>> +static int __hugepage_handle_poison(struct page *page)
>>> +{
>>> +     struct folio *folio = page_folio(page);
>>> +
>>> +     VM_WARN_ON_FOLIO(!folio_test_hwpoison(folio), folio);
>>> +
>>> +     /*
>>> +      * Can't use dissolve_free_hugetlb_folio() without a reliable
>>> +      * raw_hwp_list telling which subpage is HWPoison.
>>
>> This reminds me that hugetlb_raw_hwp_unreliable folios can be freed into buddy yet?
>> Should we handle them too?
> 
> I think we can (very likely already?) handle such
> hugetlb_raw_hwp_unreliable folios by "leaking" them: neither freeing
> them to the buddy allocator nor allocating them in
> dequeue_hugetlb_folio_node_exact().
> 
>>
>>> +      */
>>> +     if (folio_test_hugetlb_raw_hwp_unreliable(folio))
>>> +             /* raw_hwp_list becomes unreliable when kmalloc() fails. */
>>> +             return -ENOMEM;
>>
>> If folios have hugetlb_raw_hwp_unreliable set, hugetlb_update_hwpoison will return
>> MF_HUGETLB_FOLIO_PRE_POISONED thus these folios cannot reach here, e.g. me_huge_page.
>> The only way these folios can reach here is that they are hwpoisoned first time so
>> hugetlb_update_hwpoison returns 0 even if hugetlb_raw_hwp_unreliable is set at the same
> 
> For first time HWPoison-ed hugetlb folio, hugetlb_update_hwpoison()
> still returns rc=0 when kmalloc_obj() fails. So
> get_huge_page_for_hwpoison() won't override ret with rc. IOW, ret will
> be MF_HUGETLB_IN_USED or MF_HUGETLB_FREED. I believe
> MF_HUGETLB_IN_USED can still get into me_huge_page(). MF_HUGETLB_FREED
> goes directly to __hugepage_handle_poison().
> 
> So my intent is to block both places from dissolve_free_hugetlb_folio().
> 
>> time. In that case, we can simply dissolve the hugetlb folios and then take the sole hwpoisoned
>> @page off buddy? But this might not be a good idea as it is really fragile...
> 
> I think it is fragile too, and would prefer the leaking approach.
> What's your thoughts?

It's very rare to fail allocating memory for struct raw_hwp_page, because it's only a few tens of bytes.
So I think leaking approach might be acceptable.

Thanks.
.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-16  3:23       ` Jiaqi Yan
  2026-06-16  6:53         ` Miaohe Lin
@ 2026-06-17  1:56         ` Zi Yan
  2026-06-18 14:52           ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 20+ messages in thread
From: Zi Yan @ 2026-06-17  1:56 UTC (permalink / raw)
  To: Jiaqi Yan, Zi Yan, Miaohe Lin, harry.yoo
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, willy

On Mon Jun 15, 2026 at 11:23 PM EDT, Jiaqi Yan wrote:
> On Fri, Jun 12, 2026 at 11:34 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 8 Jun 2026, at 23:44, Miaohe Lin wrote:
>>
>> > On 2026/5/31 13:58, Jiaqi Yan wrote:
>> >> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>> >> becomes non-HugeTLB, and it is released to buddy allocator
>> >> as a high-order folio, e.g. a folio that contains 262144 pages
>> >> if the folio was a 1G HugeTLB hugepage.
>> >>
>> >> This is problematic if the HugeTLB hugepage contained HWPoison
>> >> subpages. In that case, since buddy allocator does not check
>> >> HWPoison for non-zero-order folio, the raw HWPoison page can
>> >> be given out with its buddy page and be re-used by either
>> >> kernel or userspace.
>> >>
>> >> Memory failure recovery (MFR) in kernel does attempt to take
>> >> raw HWPoison page off buddy allocator after
>> >> dissolve_free_hugetlb_folio(). However, there is always a time
>> >> window between dissolve_free_hugetlb_folio() frees a HWPoison
>> >> high-order folio to buddy allocator and MFR takes HWPoison
>> >> raw page off buddy allocator.
>> >>
>> >> Another similar situation is when a transparent huge page (THP)
>> >> runs into memory failure but splitting failed. Such THP will
>> >> eventually be released to buddy allocator when owning userspace
>> >> processes are gone, but with certain subpages having HWPoison.
>> >>
>> >> One obvious way to avoid both problems is to add page sanity
>> >> checks in page allocate or free path. However, it is against
>> >> the past efforts to reduce sanity check overhead [1,2,3].
>> >>
>> >> Introduce free_has_hwpoisoned() to only free the healthy pages
>> >> and to exclude the HWPoison ones in the high-order folio.
>> >> The idea is to iterate through the sub-pages of the folio to
>> >> identify contiguous ranges of healthy pages.
>> >>
>> >> free_has_hwpoisoned() is added in free_pages_prepare() as
>> >> a shortcut and is only invoked if PG_has_hwpoisoned indicates
>> >> HWPoison page exists and after checks and preparations in
>> >> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
>> >> can re-use free_prepared_contig_range() [4] to decompose healthy
>> >> ranges into the largest possible chunks of different orders.
>> >> Every chunk meets the requirements to be freed via free_one_page().
>> >>
>> >> free_has_hwpoisoned() has linear time complexity wrt the number
>> >> of pages in the folio. While the power-of-two decomposition
>> >> ensures that the number of calls to the buddy allocator is
>> >> logarithmic for each contiguous healthy range, the mandatory
>> >> linear scan of pages to identify PageHWPoison() defines the
>> >> overall time complexity. For a 1G hugepage having 8 HWPoison
>> >> pages, free_has_hwpoisoned() takes around 1ms on average on
>> >> a system having 56 Intel Skylake physical cores. This is
>> >> 15x to the case of freeing no HWPoison page. The cost is far
>> >> from triggering soft lockup, and fair for handling exceptional
>> >> hardware memory errors.
>> >>
>> >> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>> >> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>> >> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>> >> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
>> >>
>> >> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>> >
>> > Thanks for your update. This patch looks good to me while some comments below.
>> >
>> >> ---
>> >>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>> >>  1 file changed, 85 insertions(+)
>> >>
>> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> >> index e47679e7a9db..03df929abca6 100644
>> >> --- a/mm/page_alloc.c
>> >> +++ b/mm/page_alloc.c
>> >> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>> >>  unsigned int pageblock_order __read_mostly;
>> >>  #endif
>> >>
>> >> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>> >>  static void __free_pages_ok(struct page *page, unsigned int order,
>> >>                          fpi_t fpi_flags);
>> >>  static void reserve_highatomic_pageblock(struct page *page, int order,
>> >> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>> >>
>> >>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>> >>
>> >> +/*
>> >> + * Returns
>> >> + * - true: checks and preparations all good, caller can proceed freeing.
>> >> + * - false: do not proceed freeing for one of the following reasons:
>> >> + *   1. Some check failed so it is not safe to proceed freeing.
>> >> + *   2. A compound page has some HWPoison pages. The healthy pages
>> >> + *      are already safely freed, and the HWPoison ones isolated.
>> >> + */
>> >>  static __always_inline bool __free_pages_prepare(struct page *page,
>> >>              unsigned int order, fpi_t fpi_flags)
>> >>  {
>> >> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>> >>      bool init = want_init_on_free();
>> >>      bool compound = PageCompound(page);
>> >>      struct folio *folio = page_folio(page);
>> >> +    /*
>> >> +     * When dealing with compound page, PG_has_hwpoisoned is cleared
>> >> +     * with PAGE_FLAGS_SECOND. So the check must be done first.
>> >> +     *
>> >> +     * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
>> >> +     * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
>> >> +     * confuse and complaint that the first tail page is still active.
>> >> +     */
>> >> +    bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>> >>
>> >>      if (fpi_flags & FPI_PREPARED)
>> >>              return true;
>> >> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>> >>
>> >>      debug_pagealloc_unmap_pages(page, 1 << order);
>> >>
>> >> +    /*
>> >> +     * After breaking down compound page and dealing with page metadata
>> >> +     * (e.g. page owner and page alloc tags), take a shortcut if this
>> >> +     * was a compound page containing certain HWPoison subpages.
>> >> +     */
>> >> +    if (should_fhh) {
>> >> +            free_has_hwpoisoned(page, order);
>> >> +            return false;
>> >> +    }
>> >
>> > When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
>> > kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
>> > the hwpoisoned pages. Is it safe to do so?
>>
>> At least, kernel_poison_pages() writes to the page. It probably should be
>> moved up, somewhere like above kernel_poison_pages().
>
> Writing to HWPoison pages (location having memory error) is usually
> safe, as in it doesn't cause a machine check exception. Memory
> controller usually just fails the write op and waits for the next read
> to raise the MCE / exception for prevent silent data corruption.
>
>>
>> I do not like the shortcut method, since the pages are freed in
>> __free_pages_prepare(). This causes confusion. One alternative I can think
>
> What exactly is the confusion? or why does freeing has to be done by
> __free_pages_prepare()'s caller?
>
> Harry suggested the shortcut method [1]. Although freeing inline might
> surprise the caller, it simplifies things for all callers.
>
> [1] https://lore.kernel.org/linux-mm/aVy7L-3pc4JUYBEn@hyeyoo

The function name is __free_pages_prepare(), but the code ends up with
freeing the pages if the folio has HWPoison.

>
>> of is to make __free_pages_prepare() returns a enum
>> { FREE_PAGE_PREPARE_SUCCESS, FREE_PAGE_PREPARE_FAIL, FREE_PAGE_PREPARE_HAS_HWPOISON}
>> and handle the return value in the caller.
>
> Are you worried that a caller of __free_pages_prepare() may see false
> returned in the has_hwpoison case, but mistakenly propagate some kind
> of error, or doing something under the assumption that folio not
> freed? In this case the three enums can be useful. But I checked
> current callers of __free_pages_prepare() and they don't have the
> above problem.

Right. Looking at compaction_free() code, if dst gets has_hwpoison (not
possible now, but if in the future migration code decides to mark folios
when copy fails with EHWPOISON), the next list_add() is going to cause
trouble. Or you can rename the function to
__free_pages_prepare_and_free_has_hwpoison()? At least, caller knows the
potential side effect.

-- 
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test
  2026-05-31  5:58 ` [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test Jiaqi Yan
  2026-06-01 18:04   ` Jiaqi Yan
@ 2026-06-17  7:38   ` Miaohe Lin
  1 sibling, 0 replies; 20+ messages in thread
From: Miaohe Lin @ 2026-06-17  7:38 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, ziy, harry.yoo, willy

On 2026/5/31 13:58, Jiaqi Yan wrote:
> Add a new testcase to validate memory failure recovery for HWPoison
> anonymous 1G HugeTLB page, including proper SIGBUS delivery,
> releasing a 1G HugeTLB page containing one HWPoison page to buddy
> allocator, and isolation of the raw HWPoison page.
> 
> Although can be added in future, this patch does not support testing
> the MADV_SOFT variant.
> 

Thanks for your patch. Some comments below.

> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  tools/testing/selftests/mm/memory-failure.c | 73 +++++++++++++++++++--
>  1 file changed, 68 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/testing/selftests/mm/memory-failure.c b/tools/testing/selftests/mm/memory-failure.c
> index 032ed952057c..ea43b2877c81 100644
> --- a/tools/testing/selftests/mm/memory-failure.c
> +++ b/tools/testing/selftests/mm/memory-failure.c
> @@ -18,6 +18,7 @@
>  #include <linux/magic.h>
>  #include <errno.h>
>  
> +#include "hugepage_settings.h"
>  #include "vm_util.h"
>  
>  enum inject_type {
> @@ -27,6 +28,7 @@ enum inject_type {
>  
>  enum result_type {
>  	MADV_HARD_ANON,
> +	MADV_HARD_ANON_HUGETLB,
>  	MADV_HARD_CLEAN_PAGECACHE,
>  	MADV_HARD_DIRTY_PAGECACHE,
>  	MADV_SOFT_ANON,
> @@ -47,6 +49,8 @@ FIXTURE(memory_failure)
>  	int pagemap_fd;
>  	int kpageflags_fd;
>  	bool triggered;
> +	/* Number of initial HugeTLB pages with default page size. */
> +	unsigned long nr_hugetlb_pages;
>  };
>  
>  FIXTURE_VARIANT(memory_failure)
> @@ -157,11 +161,11 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
>  		  void *vaddr, enum result_type type, int setjmp)
>  {
>  	unsigned long size;
> +	unsigned long nr_hugetlb_pages;
>  	uint64_t pfn_flags;
>  
>  	switch (type) {
>  	case MADV_SOFT_ANON:
> -	case MADV_HARD_CLEAN_PAGECACHE:
>  	case MADV_SOFT_CLEAN_PAGECACHE:
>  	case MADV_SOFT_DIRTY_PAGECACHE:
>  		/* It is not expected to receive a SIGBUS signal. */
> @@ -174,6 +178,7 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
>  		ASSERT_NE(pagemap_get_pfn(self->pagemap_fd, vaddr), self->pfn);
>  		break;
>  	case MADV_HARD_ANON:
> +	case MADV_HARD_ANON_HUGETLB:
>  	case MADV_HARD_DIRTY_PAGECACHE:
>  		/* The SIGBUS signal should have been received. */
>  		ASSERT_EQ(setjmp, 1);
> @@ -183,17 +188,36 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
>  		ASSERT_EQ(siginfo.si_code, BUS_MCEERR_AR);
>  		ASSERT_EQ(1UL << siginfo.si_addr_lsb, self->page_size);
>  		ASSERT_EQ(siginfo.si_addr, vaddr);
> -
> -		/* XXX Check backing pte is hwpoison entry when supported. */
> -		ASSERT_TRUE(pagemap_is_swapped(self->pagemap_fd, vaddr));

Can we write this as something like:

		if (type != MADV_HARD_ANON_HUGETLB)
			/*
			 * XXX Check backing pte is hwpoison entry when supported.
			 * Check backing pte is hwpoison entry when supported.
			 * Although try_to_unmap_one() also installs hwpoison entry
			 * for HugeTLB, pagemap_hugetlb_range() doesn't parse
 			 * swap entries at all.
			 */
			ASSERT_TRUE(pagemap_is_swapped(self->pagemap_fd, vaddr));

So we don't need to modify the if condition when more base page size tests are added?

>  		break;
>  	default:
>  		SKIP(return, "unexpected inject type %d.\n", type);
>  	}
>  
> +	if (type == MADV_HARD_ANON || type == MADV_HARD_DIRTY_PAGECACHE) {
> +		/*
> +		 * Check backing pte is hwpoison entry when supported.
> +		 * Although try_to_unmap_one() also installs hwpoison entry
> +		 * for HugeTLB, pagemap_hugetlb_range() doesn't parse
> +		 * swap entries at all.
> +		 */
> +		ASSERT_TRUE(pagemap_is_swapped(self->pagemap_fd, vaddr));
> +	}
> +
>  	/* Check if the value of HardwareCorrupted has increased. */
>  	ASSERT_EQ(get_hardware_corrupted_size(&size), 0);
> -	ASSERT_EQ(size, self->corrupted_size + self->page_size / 1024);
> +
> +	if (type == MADV_HARD_ANON_HUGETLB) {
> +		/*
> +		 * Only one page is hardware corrupted; the rest should all be
> +		 * released to buddy allocator.
> +		 */
> +		ASSERT_EQ(size, self->corrupted_size + getpagesize() / 1024);
> +		/* HugeTLB should have lost the HWPoison HugeTLB page. */
> +		nr_hugetlb_pages = hugetlb_nr_default_pages();
> +		ASSERT_EQ(nr_hugetlb_pages + 1, self->nr_hugetlb_pages);
> +	} else {
> +		ASSERT_EQ(size, self->corrupted_size + self->page_size / 1024);
> +	}
>  
>  	/* Check if HWPoison flag is set. */
>  	ASSERT_EQ(pageflags_get(self->pfn, self->kpageflags_fd, &pfn_flags), 0);
> @@ -247,6 +271,45 @@ TEST_F(memory_failure, anon)
>  	ASSERT_EQ(munmap(addr, self->page_size), 0);
>  }
>  
> +TEST_F(memory_failure, anon_hugetlb)
> +{
> +	char *addr;
> +	int ret;
> +	const unsigned long nr_alloc_hugetlb_pages = 4;
> +	unsigned long alloc_size;
> +
> +	if (variant->type == MADV_SOFT)
> +		SKIP(return, "Soft offline test is not implemented");
> +
> +	/* HugeTLB settings will be automatically restored when test exits. */
> +	hugetlb_setup_default(nr_alloc_hugetlb_pages);
> +
> +	alloc_size = default_huge_page_size() * nr_alloc_hugetlb_pages;
> +	self->page_size = default_huge_page_size();
> +	self->nr_hugetlb_pages = hugetlb_nr_default_pages();
> +
> +	addr = mmap(0, alloc_size, PROT_READ | PROT_WRITE,
> +		    MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);
> +	if (addr == MAP_FAILED)
> +		SKIP(return, "mmap failed, not enough memory or 1G hugetlb not supported.\n");

This test is no longer 1G specific, so "1G" should be removed too?

> +	memset(addr, 0xce, alloc_size);

For 1G hugetlb, alloc_size will be 4G. Would memset() here take a long time to finish its work?
Same issue in check_memory(). Should we try to enhance this or it is just acceptable?

Thanks.
.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-17  1:56         ` Zi Yan
@ 2026-06-18 14:52           ` Vlastimil Babka (SUSE)
  2026-06-18 16:04             ` Zi Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-18 14:52 UTC (permalink / raw)
  To: Zi Yan, Jiaqi Yan, Miaohe Lin, harry.yoo
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, willy

On 6/17/26 03:56, Zi Yan wrote:
> On Mon Jun 15, 2026 at 11:23 PM EDT, Jiaqi Yan wrote:
>> On Fri, Jun 12, 2026 at 11:34 AM Zi Yan <ziy@nvidia.com> wrote:
>>>
>>> On 8 Jun 2026, at 23:44, Miaohe Lin wrote:
>>>
>>> > On 2026/5/31 13:58, Jiaqi Yan wrote:
>>> >> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>>> >> becomes non-HugeTLB, and it is released to buddy allocator
>>> >> as a high-order folio, e.g. a folio that contains 262144 pages
>>> >> if the folio was a 1G HugeTLB hugepage.
>>> >>
>>> >> This is problematic if the HugeTLB hugepage contained HWPoison
>>> >> subpages. In that case, since buddy allocator does not check
>>> >> HWPoison for non-zero-order folio, the raw HWPoison page can
>>> >> be given out with its buddy page and be re-used by either
>>> >> kernel or userspace.
>>> >>
>>> >> Memory failure recovery (MFR) in kernel does attempt to take
>>> >> raw HWPoison page off buddy allocator after
>>> >> dissolve_free_hugetlb_folio(). However, there is always a time
>>> >> window between dissolve_free_hugetlb_folio() frees a HWPoison
>>> >> high-order folio to buddy allocator and MFR takes HWPoison
>>> >> raw page off buddy allocator.
>>> >>
>>> >> Another similar situation is when a transparent huge page (THP)
>>> >> runs into memory failure but splitting failed. Such THP will
>>> >> eventually be released to buddy allocator when owning userspace
>>> >> processes are gone, but with certain subpages having HWPoison.
>>> >>
>>> >> One obvious way to avoid both problems is to add page sanity
>>> >> checks in page allocate or free path. However, it is against
>>> >> the past efforts to reduce sanity check overhead [1,2,3].
>>> >>
>>> >> Introduce free_has_hwpoisoned() to only free the healthy pages
>>> >> and to exclude the HWPoison ones in the high-order folio.
>>> >> The idea is to iterate through the sub-pages of the folio to
>>> >> identify contiguous ranges of healthy pages.
>>> >>
>>> >> free_has_hwpoisoned() is added in free_pages_prepare() as
>>> >> a shortcut and is only invoked if PG_has_hwpoisoned indicates
>>> >> HWPoison page exists and after checks and preparations in
>>> >> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
>>> >> can re-use free_prepared_contig_range() [4] to decompose healthy
>>> >> ranges into the largest possible chunks of different orders.
>>> >> Every chunk meets the requirements to be freed via free_one_page().
>>> >>
>>> >> free_has_hwpoisoned() has linear time complexity wrt the number
>>> >> of pages in the folio. While the power-of-two decomposition
>>> >> ensures that the number of calls to the buddy allocator is
>>> >> logarithmic for each contiguous healthy range, the mandatory
>>> >> linear scan of pages to identify PageHWPoison() defines the
>>> >> overall time complexity. For a 1G hugepage having 8 HWPoison
>>> >> pages, free_has_hwpoisoned() takes around 1ms on average on
>>> >> a system having 56 Intel Skylake physical cores. This is
>>> >> 15x to the case of freeing no HWPoison page. The cost is far
>>> >> from triggering soft lockup, and fair for handling exceptional
>>> >> hardware memory errors.
>>> >>
>>> >> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>>> >> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>>> >> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>> >> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
>>> >>
>>> >> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>> >
>>> > Thanks for your update. This patch looks good to me while some comments below.
>>> >
>>> >> ---
>>> >>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>  1 file changed, 85 insertions(+)
>>> >>
>>> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> >> index e47679e7a9db..03df929abca6 100644
>>> >> --- a/mm/page_alloc.c
>>> >> +++ b/mm/page_alloc.c
>>> >> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>>> >>  unsigned int pageblock_order __read_mostly;
>>> >>  #endif
>>> >>
>>> >> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>>> >>  static void __free_pages_ok(struct page *page, unsigned int order,
>>> >>                          fpi_t fpi_flags);
>>> >>  static void reserve_highatomic_pageblock(struct page *page, int order,
>>> >> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>>> >>
>>> >>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>>> >>
>>> >> +/*
>>> >> + * Returns
>>> >> + * - true: checks and preparations all good, caller can proceed freeing.
>>> >> + * - false: do not proceed freeing for one of the following reasons:
>>> >> + *   1. Some check failed so it is not safe to proceed freeing.
>>> >> + *   2. A compound page has some HWPoison pages. The healthy pages
>>> >> + *      are already safely freed, and the HWPoison ones isolated.
>>> >> + */
>>> >>  static __always_inline bool __free_pages_prepare(struct page *page,
>>> >>              unsigned int order, fpi_t fpi_flags)
>>> >>  {
>>> >> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>> >>      bool init = want_init_on_free();
>>> >>      bool compound = PageCompound(page);
>>> >>      struct folio *folio = page_folio(page);
>>> >> +    /*
>>> >> +     * When dealing with compound page, PG_has_hwpoisoned is cleared
>>> >> +     * with PAGE_FLAGS_SECOND. So the check must be done first.
>>> >> +     *
>>> >> +     * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
>>> >> +     * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
>>> >> +     * confuse and complaint that the first tail page is still active.
>>> >> +     */
>>> >> +    bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>>> >>
>>> >>      if (fpi_flags & FPI_PREPARED)
>>> >>              return true;
>>> >> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>> >>
>>> >>      debug_pagealloc_unmap_pages(page, 1 << order);
>>> >>
>>> >> +    /*
>>> >> +     * After breaking down compound page and dealing with page metadata
>>> >> +     * (e.g. page owner and page alloc tags), take a shortcut if this
>>> >> +     * was a compound page containing certain HWPoison subpages.
>>> >> +     */
>>> >> +    if (should_fhh) {
>>> >> +            free_has_hwpoisoned(page, order);
>>> >> +            return false;
>>> >> +    }
>>> >
>>> > When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
>>> > kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
>>> > the hwpoisoned pages. Is it safe to do so?
>>>
>>> At least, kernel_poison_pages() writes to the page. It probably should be
>>> moved up, somewhere like above kernel_poison_pages().
>>
>> Writing to HWPoison pages (location having memory error) is usually
>> safe, as in it doesn't cause a machine check exception. Memory
>> controller usually just fails the write op and waits for the next read
>> to raise the MCE / exception for prevent silent data corruption.
>>
>>>
>>> I do not like the shortcut method, since the pages are freed in
>>> __free_pages_prepare(). This causes confusion. One alternative I can think
>>
>> What exactly is the confusion? or why does freeing has to be done by
>> __free_pages_prepare()'s caller?
>>
>> Harry suggested the shortcut method [1]. Although freeing inline might
>> surprise the caller, it simplifies things for all callers.
>>
>> [1] https://lore.kernel.org/linux-mm/aVy7L-3pc4JUYBEn@hyeyoo
> 
> The function name is __free_pages_prepare(), but the code ends up with
> freeing the pages if the folio has HWPoison.

It might free some, or leak some (already before this patch). Seems to me
really the only important part for the caller is if it can proceed freeing
or not.

>>
>>> of is to make __free_pages_prepare() returns a enum
>>> { FREE_PAGE_PREPARE_SUCCESS, FREE_PAGE_PREPARE_FAIL, FREE_PAGE_PREPARE_HAS_HWPOISON}
>>> and handle the return value in the caller.

Until the callers need to distinguish that, it sounds like an unnecessary
complication to me.

>> Are you worried that a caller of __free_pages_prepare() may see false
>> returned in the has_hwpoison case, but mistakenly propagate some kind
>> of error, or doing something under the assumption that folio not
>> freed? In this case the three enums can be useful. But I checked
>> current callers of __free_pages_prepare() and they don't have the
>> above problem.
> 
> Right. Looking at compaction_free() code, if dst gets has_hwpoison (not
> possible now, but if in the future migration code decides to mark folios
> when copy fails with EHWPOISON), the next list_add() is going to cause

I think we should fix compaction_free() (as a separate patch preceding this
one) to not proceed if prepare returns false. It could in theory already
happen before this patch due to a random memory corruption of struct page
causing some of the existing checks to fail.

> trouble. Or you can rename the function to
> __free_pages_prepare_and_free_has_hwpoison()? At least, caller knows the
> potential side effect.

Uh that's long. All callers are from PAGE ALLOCATOR mm-subsystem, it's not a
driver API so we'll know what we're doing (famous last words :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-16  6:53         ` Miaohe Lin
@ 2026-06-18 15:02           ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-18 15:02 UTC (permalink / raw)
  To: Miaohe Lin, Jiaqi Yan
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, willy, Zi Yan, harry.yoo

On 6/16/26 08:53, Miaohe Lin wrote:
> On 2026/6/16 11:23, Jiaqi Yan wrote:
>> On Fri, Jun 12, 2026 at 11:34 AM Zi Yan <ziy@nvidia.com> wrote:
>>>
>>> On 8 Jun 2026, at 23:44, Miaohe Lin wrote:
>>>
>>>> On 2026/5/31 13:58, Jiaqi Yan wrote:
>>>>> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>>>>> becomes non-HugeTLB, and it is released to buddy allocator
>>>>> as a high-order folio, e.g. a folio that contains 262144 pages
>>>>> if the folio was a 1G HugeTLB hugepage.
>>>>>
>>>>> This is problematic if the HugeTLB hugepage contained HWPoison
>>>>> subpages. In that case, since buddy allocator does not check
>>>>> HWPoison for non-zero-order folio, the raw HWPoison page can
>>>>> be given out with its buddy page and be re-used by either
>>>>> kernel or userspace.
>>>>>
>>>>> Memory failure recovery (MFR) in kernel does attempt to take
>>>>> raw HWPoison page off buddy allocator after
>>>>> dissolve_free_hugetlb_folio(). However, there is always a time
>>>>> window between dissolve_free_hugetlb_folio() frees a HWPoison
>>>>> high-order folio to buddy allocator and MFR takes HWPoison
>>>>> raw page off buddy allocator.
>>>>>
>>>>> Another similar situation is when a transparent huge page (THP)
>>>>> runs into memory failure but splitting failed. Such THP will
>>>>> eventually be released to buddy allocator when owning userspace
>>>>> processes are gone, but with certain subpages having HWPoison.
>>>>>
>>>>> One obvious way to avoid both problems is to add page sanity
>>>>> checks in page allocate or free path. However, it is against
>>>>> the past efforts to reduce sanity check overhead [1,2,3].
>>>>>
>>>>> Introduce free_has_hwpoisoned() to only free the healthy pages
>>>>> and to exclude the HWPoison ones in the high-order folio.
>>>>> The idea is to iterate through the sub-pages of the folio to
>>>>> identify contiguous ranges of healthy pages.
>>>>>
>>>>> free_has_hwpoisoned() is added in free_pages_prepare() as
>>>>> a shortcut and is only invoked if PG_has_hwpoisoned indicates
>>>>> HWPoison page exists and after checks and preparations in
>>>>> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
>>>>> can re-use free_prepared_contig_range() [4] to decompose healthy
>>>>> ranges into the largest possible chunks of different orders.
>>>>> Every chunk meets the requirements to be freed via free_one_page().
>>>>>
>>>>> free_has_hwpoisoned() has linear time complexity wrt the number
>>>>> of pages in the folio. While the power-of-two decomposition
>>>>> ensures that the number of calls to the buddy allocator is
>>>>> logarithmic for each contiguous healthy range, the mandatory
>>>>> linear scan of pages to identify PageHWPoison() defines the
>>>>> overall time complexity. For a 1G hugepage having 8 HWPoison
>>>>> pages, free_has_hwpoisoned() takes around 1ms on average on
>>>>> a system having 56 Intel Skylake physical cores. This is
>>>>> 15x to the case of freeing no HWPoison page. The cost is far
>>>>> from triggering soft lockup, and fair for handling exceptional
>>>>> hardware memory errors.
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>>>>> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>>>>> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>>>> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
>>>>>
>>>>> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>>>
>>>> Thanks for your update. This patch looks good to me while some comments below.
>>>>
>>>>> ---
>>>>>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 85 insertions(+)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index e47679e7a9db..03df929abca6 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>>>>>  unsigned int pageblock_order __read_mostly;
>>>>>  #endif
>>>>>
>>>>> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>>>>>  static void __free_pages_ok(struct page *page, unsigned int order,
>>>>>                          fpi_t fpi_flags);
>>>>>  static void reserve_highatomic_pageblock(struct page *page, int order,
>>>>> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>>>>>
>>>>>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>>>>>
>>>>> +/*
>>>>> + * Returns
>>>>> + * - true: checks and preparations all good, caller can proceed freeing.
>>>>> + * - false: do not proceed freeing for one of the following reasons:
>>>>> + *   1. Some check failed so it is not safe to proceed freeing.
>>>>> + *   2. A compound page has some HWPoison pages. The healthy pages
>>>>> + *      are already safely freed, and the HWPoison ones isolated.
>>>>> + */
>>>>>  static __always_inline bool __free_pages_prepare(struct page *page,
>>>>>              unsigned int order, fpi_t fpi_flags)
>>>>>  {
>>>>> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>>>      bool init = want_init_on_free();
>>>>>      bool compound = PageCompound(page);
>>>>>      struct folio *folio = page_folio(page);
>>>>> +    /*
>>>>> +     * When dealing with compound page, PG_has_hwpoisoned is cleared
>>>>> +     * with PAGE_FLAGS_SECOND. So the check must be done first.
>>>>> +     *
>>>>> +     * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
>>>>> +     * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
>>>>> +     * confuse and complaint that the first tail page is still active.
>>>>> +     */
>>>>> +    bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>>>>>
>>>>>      if (fpi_flags & FPI_PREPARED)
>>>>>              return true;
>>>>> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>>>
>>>>>      debug_pagealloc_unmap_pages(page, 1 << order);
>>>>>
>>>>> +    /*
>>>>> +     * After breaking down compound page and dealing with page metadata
>>>>> +     * (e.g. page owner and page alloc tags), take a shortcut if this
>>>>> +     * was a compound page containing certain HWPoison subpages.
>>>>> +     */
>>>>> +    if (should_fhh) {
>>>>> +            free_has_hwpoisoned(page, order);
>>>>> +            return false;
>>>>> +    }
>>>>
>>>> When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
>>>> kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
>>>> the hwpoisoned pages. Is it safe to do so?
>>>
>>> At least, kernel_poison_pages() writes to the page. It probably should be
>>> moved up, somewhere like above kernel_poison_pages().
>> 
>> Writing to HWPoison pages (location having memory error) is usually
>> safe, as in it doesn't cause a machine check exception. Memory
> 
> In x86, set_mce_nospec is called for hwpoisoned pages. So writing to
> HWPoison pages would cause unexpected page fault in kernel?
> 
> Thanks.

Seems we'll have to extract everything between kernel_poison_pages() and
debug_pagealloc_unmap_pages() to a function, don't call it when should_fhh
is true and instead call it in free_has_hwpoisoned() on pages that are not
HWPoison? I think it's acceptable to do it there one page at a time in order
not to complicate things as most of that stuff is debug-only and we're
handling a rare event.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
  2026-06-18 14:52           ` Vlastimil Babka (SUSE)
@ 2026-06-18 16:04             ` Zi Yan
  0 siblings, 0 replies; 20+ messages in thread
From: Zi Yan @ 2026-06-18 16:04 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Zi Yan, Jiaqi Yan, Miaohe Lin, harry.yoo
  Cc: osalvador, lorenzo.stoakes, jackmanb, hannes, nao.horiguchi,
	david, william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	muchun.song, liam, rientjes, duenwen, jthoughton, linux-mm,
	linux-kernel, vbabka, rppt, shuah, surenb, mhocko, boudewijn, ljs,
	osalvador, willy

On Thu Jun 18, 2026 at 10:52 AM EDT, Vlastimil Babka (SUSE) wrote:
> On 6/17/26 03:56, Zi Yan wrote:
>> On Mon Jun 15, 2026 at 11:23 PM EDT, Jiaqi Yan wrote:
>>> On Fri, Jun 12, 2026 at 11:34 AM Zi Yan <ziy@nvidia.com> wrote:
>>>>
>>>> On 8 Jun 2026, at 23:44, Miaohe Lin wrote:
>>>>
>>>> > On 2026/5/31 13:58, Jiaqi Yan wrote:
>>>> >> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>>>> >> becomes non-HugeTLB, and it is released to buddy allocator
>>>> >> as a high-order folio, e.g. a folio that contains 262144 pages
>>>> >> if the folio was a 1G HugeTLB hugepage.
>>>> >>
>>>> >> This is problematic if the HugeTLB hugepage contained HWPoison
>>>> >> subpages. In that case, since buddy allocator does not check
>>>> >> HWPoison for non-zero-order folio, the raw HWPoison page can
>>>> >> be given out with its buddy page and be re-used by either
>>>> >> kernel or userspace.
>>>> >>
>>>> >> Memory failure recovery (MFR) in kernel does attempt to take
>>>> >> raw HWPoison page off buddy allocator after
>>>> >> dissolve_free_hugetlb_folio(). However, there is always a time
>>>> >> window between dissolve_free_hugetlb_folio() frees a HWPoison
>>>> >> high-order folio to buddy allocator and MFR takes HWPoison
>>>> >> raw page off buddy allocator.
>>>> >>
>>>> >> Another similar situation is when a transparent huge page (THP)
>>>> >> runs into memory failure but splitting failed. Such THP will
>>>> >> eventually be released to buddy allocator when owning userspace
>>>> >> processes are gone, but with certain subpages having HWPoison.
>>>> >>
>>>> >> One obvious way to avoid both problems is to add page sanity
>>>> >> checks in page allocate or free path. However, it is against
>>>> >> the past efforts to reduce sanity check overhead [1,2,3].
>>>> >>
>>>> >> Introduce free_has_hwpoisoned() to only free the healthy pages
>>>> >> and to exclude the HWPoison ones in the high-order folio.
>>>> >> The idea is to iterate through the sub-pages of the folio to
>>>> >> identify contiguous ranges of healthy pages.
>>>> >>
>>>> >> free_has_hwpoisoned() is added in free_pages_prepare() as
>>>> >> a shortcut and is only invoked if PG_has_hwpoisoned indicates
>>>> >> HWPoison page exists and after checks and preparations in
>>>> >> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
>>>> >> can re-use free_prepared_contig_range() [4] to decompose healthy
>>>> >> ranges into the largest possible chunks of different orders.
>>>> >> Every chunk meets the requirements to be freed via free_one_page().
>>>> >>
>>>> >> free_has_hwpoisoned() has linear time complexity wrt the number
>>>> >> of pages in the folio. While the power-of-two decomposition
>>>> >> ensures that the number of calls to the buddy allocator is
>>>> >> logarithmic for each contiguous healthy range, the mandatory
>>>> >> linear scan of pages to identify PageHWPoison() defines the
>>>> >> overall time complexity. For a 1G hugepage having 8 HWPoison
>>>> >> pages, free_has_hwpoisoned() takes around 1ms on average on
>>>> >> a system having 56 Intel Skylake physical cores. This is
>>>> >> 15x to the case of freeing no HWPoison page. The cost is far
>>>> >> from triggering soft lockup, and fair for handling exceptional
>>>> >> hardware memory errors.
>>>> >>
>>>> >> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>>>> >> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>>>> >> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>>> >> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
>>>> >>
>>>> >> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>>> >
>>>> > Thanks for your update. This patch looks good to me while some comments below.
>>>> >
>>>> >> ---
>>>> >>  mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>  1 file changed, 85 insertions(+)
>>>> >>
>>>> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> >> index e47679e7a9db..03df929abca6 100644
>>>> >> --- a/mm/page_alloc.c
>>>> >> +++ b/mm/page_alloc.c
>>>> >> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>>>> >>  unsigned int pageblock_order __read_mostly;
>>>> >>  #endif
>>>> >>
>>>> >> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
>>>> >>  static void __free_pages_ok(struct page *page, unsigned int order,
>>>> >>                          fpi_t fpi_flags);
>>>> >>  static void reserve_highatomic_pageblock(struct page *page, int order,
>>>> >> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>>>> >>
>>>> >>  #endif /* CONFIG_MEM_ALLOC_PROFILING */
>>>> >>
>>>> >> +/*
>>>> >> + * Returns
>>>> >> + * - true: checks and preparations all good, caller can proceed freeing.
>>>> >> + * - false: do not proceed freeing for one of the following reasons:
>>>> >> + *   1. Some check failed so it is not safe to proceed freeing.
>>>> >> + *   2. A compound page has some HWPoison pages. The healthy pages
>>>> >> + *      are already safely freed, and the HWPoison ones isolated.
>>>> >> + */
>>>> >>  static __always_inline bool __free_pages_prepare(struct page *page,
>>>> >>              unsigned int order, fpi_t fpi_flags)
>>>> >>  {
>>>> >> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>> >>      bool init = want_init_on_free();
>>>> >>      bool compound = PageCompound(page);
>>>> >>      struct folio *folio = page_folio(page);
>>>> >> +    /*
>>>> >> +     * When dealing with compound page, PG_has_hwpoisoned is cleared
>>>> >> +     * with PAGE_FLAGS_SECOND. So the check must be done first.
>>>> >> +     *
>>>> >> +     * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
>>>> >> +     * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
>>>> >> +     * confuse and complaint that the first tail page is still active.
>>>> >> +     */
>>>> >> +    bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>>>> >>
>>>> >>      if (fpi_flags & FPI_PREPARED)
>>>> >>              return true;
>>>> >> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>>>> >>
>>>> >>      debug_pagealloc_unmap_pages(page, 1 << order);
>>>> >>
>>>> >> +    /*
>>>> >> +     * After breaking down compound page and dealing with page metadata
>>>> >> +     * (e.g. page owner and page alloc tags), take a shortcut if this
>>>> >> +     * was a compound page containing certain HWPoison subpages.
>>>> >> +     */
>>>> >> +    if (should_fhh) {
>>>> >> +            free_has_hwpoisoned(page, order);
>>>> >> +            return false;
>>>> >> +    }
>>>> >
>>>> > When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
>>>> > kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
>>>> > the hwpoisoned pages. Is it safe to do so?
>>>>
>>>> At least, kernel_poison_pages() writes to the page. It probably should be
>>>> moved up, somewhere like above kernel_poison_pages().
>>>
>>> Writing to HWPoison pages (location having memory error) is usually
>>> safe, as in it doesn't cause a machine check exception. Memory
>>> controller usually just fails the write op and waits for the next read
>>> to raise the MCE / exception for prevent silent data corruption.
>>>
>>>>
>>>> I do not like the shortcut method, since the pages are freed in
>>>> __free_pages_prepare(). This causes confusion. One alternative I can think
>>>
>>> What exactly is the confusion? or why does freeing has to be done by
>>> __free_pages_prepare()'s caller?
>>>
>>> Harry suggested the shortcut method [1]. Although freeing inline might
>>> surprise the caller, it simplifies things for all callers.
>>>
>>> [1] https://lore.kernel.org/linux-mm/aVy7L-3pc4JUYBEn@hyeyoo
>> 
>> The function name is __free_pages_prepare(), but the code ends up with
>> freeing the pages if the folio has HWPoison.
>
> It might free some, or leak some (already before this patch). Seems to me
> really the only important part for the caller is if it can proceed freeing
> or not.

OK, at least add a comment to __free_pages_prepare() to document all
possible outcomes.

>
>>>
>>>> of is to make __free_pages_prepare() returns a enum
>>>> { FREE_PAGE_PREPARE_SUCCESS, FREE_PAGE_PREPARE_FAIL, FREE_PAGE_PREPARE_HAS_HWPOISON}
>>>> and handle the return value in the caller.
>
> Until the callers need to distinguish that, it sounds like an unnecessary
> complication to me.
>

Sure.

>>> Are you worried that a caller of __free_pages_prepare() may see false
>>> returned in the has_hwpoison case, but mistakenly propagate some kind
>>> of error, or doing something under the assumption that folio not
>>> freed? In this case the three enums can be useful. But I checked
>>> current callers of __free_pages_prepare() and they don't have the
>>> above problem.
>> 
>> Right. Looking at compaction_free() code, if dst gets has_hwpoison (not
>> possible now, but if in the future migration code decides to mark folios
>> when copy fails with EHWPOISON), the next list_add() is going to cause
>
> I think we should fix compaction_free() (as a separate patch preceding this
> one) to not proceed if prepare returns false. It could in theory already
> happen before this patch due to a random memory corruption of struct page
> causing some of the existing checks to fail.

Something like below should do the job, plus
Fixes: 733aea0b3a7bb ("mm/compaction: add support for >0 order folio memory compaction.")

diff --git a/mm/compaction.c b/mm/compaction.c
index b776f35ad0200..b2104cbe63477 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1876,10 +1876,12 @@ static void compaction_free(struct folio *dst, unsigned long data)
 	struct page *page = &dst->page;
 
 	if (folio_put_testzero(dst)) {
-		free_pages_prepare(page, order);
+		if (!free_pages_prepare(page, order))
+			goto skip;
 		list_add(&dst->lru, &cc->freepages[order]);
 		cc->nr_freepages += 1 << order;
 	}
+skip:
 	cc->nr_migratepages += 1 << order;
 	/*
 	 * someone else has referenced the page, we cannot take it back to our

>
>> trouble. Or you can rename the function to
>> __free_pages_prepare_and_free_has_hwpoison()? At least, caller knows the
>> potential side effect.
>
> Uh that's long. All callers are from PAGE ALLOCATOR mm-subsystem, it's not a
> driver API so we'll know what we're doing (famous last words :)

The name above is a half joke. :)

BTW, I do not even trust myself sometimes. ;) Just look at the
compaction_free() issue I introduced myself. But I do not want to be too
pedantic here. So a comment above __free_pages_prepare() should be
sufficient.

-- 
Best Regards,
Yan, Zi


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-06-18 16:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-31  5:58 [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
2026-05-31  5:58 ` [PATCH v5 1/4] mm/page_alloc: only " Jiaqi Yan
2026-06-09  3:44   ` Miaohe Lin
2026-06-12 18:34     ` Zi Yan
2026-06-16  3:23       ` Jiaqi Yan
2026-06-16  6:53         ` Miaohe Lin
2026-06-18 15:02           ` Vlastimil Babka (SUSE)
2026-06-17  1:56         ` Zi Yan
2026-06-18 14:52           ` Vlastimil Babka (SUSE)
2026-06-18 16:04             ` Zi Yan
2026-06-15  2:03     ` Jiaqi Yan
2026-05-31  5:58 ` [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio Jiaqi Yan
2026-06-09  6:34   ` Miaohe Lin
2026-05-31  5:58 ` [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page Jiaqi Yan
2026-06-09  7:21   ` Miaohe Lin
2026-06-15  0:16     ` Jiaqi Yan
2026-06-16  7:05       ` Miaohe Lin
2026-05-31  5:58 ` [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test Jiaqi Yan
2026-06-01 18:04   ` Jiaqi Yan
2026-06-17  7:38   ` Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox