From: Jiaqi Yan <jiaqiyan@google.com>
To: ljs@kernel.org, linmiaohe@huawei.com, osalvador@kernel.org,
ziy@nvidia.com, harry.yoo@oracle.com, willy@infradead.org
Cc: osalvador@suse.de, lorenzo.stoakes@oracle.com,
jackmanb@google.com, hannes@cmpxchg.org,
nao.horiguchi@gmail.com, david@kernel.org,
william.roche@oracle.com, tony.luck@intel.com,
wangkefeng.wang@huawei.com, jane.chu@oracle.com,
akpm@linux-foundation.org, muchun.song@linux.dev,
liam@infradead.org, rientjes@google.com, duenwen@google.com,
jthoughton@google.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, vbabka@suse.cz, rppt@kernel.org,
shuah@kernel.org, surenb@google.com, mhocko@suse.com,
boudewijn@delta-utec.com, Jiaqi Yan <jiaqiyan@google.com>
Subject: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
Date: Sun, 31 May 2026 05:58:26 +0000 [thread overview]
Message-ID: <20260531055829.3636554-2-jiaqiyan@google.com> (raw)
In-Reply-To: <20260531055829.3636554-1-jiaqiyan@google.com>
At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
becomes non-HugeTLB, and it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.
This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.
Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio(). However, there is always a time
window between dissolve_free_hugetlb_folio() frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.
Another similar situation is when a transparent huge page (THP)
runs into memory failure but splitting failed. Such THP will
eventually be released to buddy allocator when owning userspace
processes are gone, but with certain subpages having HWPoison.
One obvious way to avoid both problems is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].
Introduce free_has_hwpoisoned() to only free the healthy pages
and to exclude the HWPoison ones in the high-order folio.
The idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages.
free_has_hwpoisoned() is added in free_pages_prepare() as
a shortcut and is only invoked if PG_has_hwpoisoned indicates
HWPoison page exists and after checks and preparations in
free_pages_prepare() all succeeded. free_has_hwpoisoned() then
can re-use free_prepared_contig_range() [4] to decompose healthy
ranges into the largest possible chunks of different orders.
Every chunk meets the requirements to be freed via free_one_page().
free_has_hwpoisoned() has linear time complexity wrt the number
of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison() defines the
overall time complexity. For a 1G hugepage having 8 HWPoison
pages, free_has_hwpoisoned() takes around 1ms on average on
a system having 56 Intel Skylake physical cores. This is
15x to the case of freeing no HWPoison page. The cost is far
from triggering soft lockup, and fair for handling exceptional
hardware memory errors.
[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@arm.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 85 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e47679e7a9db..03df929abca6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
unsigned int pageblock_order __read_mostly;
#endif
+static void free_has_hwpoisoned(struct page *page, unsigned int order);
static void __free_pages_ok(struct page *page, unsigned int order,
fpi_t fpi_flags);
static void reserve_highatomic_pageblock(struct page *page, int order,
@@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
#endif /* CONFIG_MEM_ALLOC_PROFILING */
+/*
+ * Returns
+ * - true: checks and preparations all good, caller can proceed freeing.
+ * - false: do not proceed freeing for one of the following reasons:
+ * 1. Some check failed so it is not safe to proceed freeing.
+ * 2. A compound page has some HWPoison pages. The healthy pages
+ * are already safely freed, and the HWPoison ones isolated.
+ */
static __always_inline bool __free_pages_prepare(struct page *page,
unsigned int order, fpi_t fpi_flags)
{
@@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
bool init = want_init_on_free();
bool compound = PageCompound(page);
struct folio *folio = page_folio(page);
+ /*
+ * When dealing with compound page, PG_has_hwpoisoned is cleared
+ * with PAGE_FLAGS_SECOND. So the check must be done first.
+ *
+ * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
+ * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
+ * confuse and complaint that the first tail page is still active.
+ */
+ bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
if (fpi_flags & FPI_PREPARED)
return true;
@@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
debug_pagealloc_unmap_pages(page, 1 << order);
+ /*
+ * After breaking down compound page and dealing with page metadata
+ * (e.g. page owner and page alloc tags), take a shortcut if this
+ * was a compound page containing certain HWPoison subpages.
+ */
+ if (should_fhh) {
+ free_has_hwpoisoned(page, order);
+ return false;
+ }
+
return true;
}
@@ -6936,6 +6964,63 @@ void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
__free_contig_range_common(pfn, nr_pages, /* is_frozen= */ false);
}
+/*
+ * Given a high-order compound page containing certain number of HWPoison
+ * pages, free only the healthy ones.
+ *
+ * Pages must have passed free_pages_prepare(). Even if having HWPoison
+ * pages, breaking down compound page and updating metadata (e.g. page
+ * owner, alloc tag) can be done together during free_pages_prepare(),
+ * which simplifies the splitting here: unlike __split_unmapped_folio(),
+ * there is no need to turn split pages into a compound page or to carry
+ * metadata.
+ *
+ * It scans every raw page of the compound page and cause nontrivial overhead.
+ * So only use this when the compound page contains HWPoison page(s).
+ *
+ * This implementation needs rework in memdesc world.
+ */
+static void free_has_hwpoisoned(struct page *page, unsigned int order)
+{
+ unsigned long curr = page_to_pfn(page);
+ unsigned long end_pfn = curr + (1 << order);
+ unsigned long next;
+ unsigned long total_freed = 0;
+ unsigned long total_hwp = 0;
+
+ VM_WARN_ON(order == 0);
+ VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
+
+ while (curr < end_pfn) {
+ next = curr;
+
+ while (next < end_pfn && !PageHWPoison(pfn_to_page(next)))
+ ++next;
+
+ if (next != end_pfn && PageHWPoison(pfn_to_page(next))) {
+ /*
+ * Avoid accounting error when the page is freed
+ * by unpoison_memory().
+ */
+ clear_page_tag_ref(pfn_to_page(next));
+ ++total_hwp;
+ }
+
+ free_prepared_contig_range(pfn_to_page(curr), next - curr);
+ total_freed += next - curr;
+
+ if (next == end_pfn)
+ break;
+
+ VM_WARN_ON(!PageHWPoison(pfn_to_page(next)));
+ curr = next + 1;
+ }
+
+ VM_WARN_ON(total_freed + total_hwp != (1 << order));
+ pr_info("Freed %#lx pages, excluded %lu HWPoison pages\n",
+ total_freed, total_hwp);
+}
+
#ifdef CONFIG_CONTIG_ALLOC
/* Usage: See admin-guide/dynamic-debug-howto.rst */
static void alloc_contig_dump_pages(struct list_head *page_list)
--
2.54.0.823.g6e5bcc1fc9-goog
next prev parent reply other threads:[~2026-05-31 5:58 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-31 5:58 [PATCH v5 0/4] Only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
2026-05-31 5:58 ` Jiaqi Yan [this message]
2026-06-09 3:44 ` [PATCH v5 1/4] mm/page_alloc: only " Miaohe Lin
2026-06-12 18:34 ` Zi Yan
[not found] ` <CACw3F51hi1SAs264i0FKMbieOQhCpRQQ9s7gS_cHDYgHsqu0WQ@mail.gmail.com>
2026-06-17 1:56 ` Zi Yan
2026-06-18 14:52 ` Vlastimil Babka (SUSE)
2026-06-18 16:04 ` Zi Yan
[not found] ` <ce2f2cea-1451-09a4-4562-62808b1c2d93@huawei.com>
2026-06-18 15:02 ` Vlastimil Babka (SUSE)
2026-06-15 2:03 ` Jiaqi Yan
2026-05-31 5:58 ` [PATCH v5 2/4] mm/memory-failure: set has_hwpoisoned flags on dissolved HugeTLB folio Jiaqi Yan
2026-06-09 6:34 ` Miaohe Lin
2026-05-31 5:58 ` [PATCH v5 3/4] mm/memory-failure: skip take_page_off_buddy after dissolving HWPoison HugeTLB page Jiaqi Yan
2026-06-09 7:21 ` Miaohe Lin
2026-06-15 0:16 ` Jiaqi Yan
2026-05-31 5:58 ` [PATCH v5 4/4] selftests/mm: add hard memory failure anonymous 1G HugeTLB page test Jiaqi Yan
2026-06-01 18:04 ` Jiaqi Yan
2026-06-17 7:38 ` Miaohe Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260531055829.3636554-2-jiaqiyan@google.com \
--to=jiaqiyan@google.com \
--cc=akpm@linux-foundation.org \
--cc=boudewijn@delta-utec.com \
--cc=david@kernel.org \
--cc=duenwen@google.com \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=jackmanb@google.com \
--cc=jane.chu@oracle.com \
--cc=jthoughton@google.com \
--cc=liam@infradead.org \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=osalvador@kernel.org \
--cc=osalvador@suse.de \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=shuah@kernel.org \
--cc=surenb@google.com \
--cc=tony.luck@intel.com \
--cc=vbabka@suse.cz \
--cc=wangkefeng.wang@huawei.com \
--cc=william.roche@oracle.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox