* [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
@ 2026-04-24 12:23 ` Breno Leitao
2026-04-27 12:33 ` Lance Yang
2026-04-24 12:24 ` [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
` (4 subsequent siblings)
5 siblings, 1 reply; 18+ messages in thread
From: Breno Leitao @ 2026-04-24 12:23 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
kernel-team
When get_hwpoison_page() returns a negative value, distinguish
reserved pages from other failure cases by reporting MF_MSG_KERNEL
instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
and should be classified accordingly for proper handling.
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d43613097..7b67e43dafbd1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2432,7 +2432,16 @@ int memory_failure(unsigned long pfn, int flags)
}
goto unlock_mutex;
} else if (res < 0) {
- res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
+ /*
+ * PageReserved is stable here: reserved pages have
+ * PG_reserved set at boot or by drivers and are never
+ * freed through the page allocator.
+ */
+ if (PageReserved(p))
+ res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+ else
+ res = action_result(pfn, MF_MSG_GET_HWPOISON,
+ MF_IGNORED);
goto unlock_mutex;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
2026-04-24 12:23 ` [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
@ 2026-04-27 12:33 ` Lance Yang
2026-04-27 14:45 ` Breno Leitao
2026-04-27 15:57 ` Lance Yang
0 siblings, 2 replies; 18+ messages in thread
From: Lance Yang @ 2026-04-27 12:33 UTC (permalink / raw)
To: leitao
Cc: linmiaohe, nao.horiguchi, akpm, corbet, skhan, david, ljs,
Liam.Howlett, vbabka, rppt, surenb, mhocko, shuah, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kernel-team, Lance Yang
On Fri, Apr 24, 2026 at 05:23:59AM -0700, Breno Leitao wrote:
>When get_hwpoison_page() returns a negative value, distinguish
>reserved pages from other failure cases by reporting MF_MSG_KERNEL
>instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>and should be classified accordingly for proper handling.
>
>Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>Signed-off-by: Breno Leitao <leitao@debian.org>
>---
> mm/memory-failure.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
>diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>index ee42d43613097..7b67e43dafbd1 100644
>--- a/mm/memory-failure.c
>+++ b/mm/memory-failure.c
>@@ -2432,7 +2432,16 @@ int memory_failure(unsigned long pfn, int flags)
> }
> goto unlock_mutex;
> } else if (res < 0) {
>- res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>+ /*
>+ * PageReserved is stable here: reserved pages have
>+ * PG_reserved set at boot or by drivers and are never
>+ * freed through the page allocator.
>+ */
Not necessarily. PG_reserved is not a permanent lifetime property for
every page that has carried it.
page-flags.h says early reserved pages may later have PG_reserved
cleared and then be given to the page allocator :)
At least some drivers also clear PG_reserved when releasing pages they
marked reserved.
Would it be clearer to say that pages with PG_reserved set are not
currently managed by the page allocator, rather than saying reserved
pages are never freed through the page allocator?
Otherwise, LGTM.
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
2026-04-27 12:33 ` Lance Yang
@ 2026-04-27 14:45 ` Breno Leitao
2026-04-27 15:14 ` Lance Yang
2026-04-27 15:57 ` Lance Yang
1 sibling, 1 reply; 18+ messages in thread
From: Breno Leitao @ 2026-04-27 14:45 UTC (permalink / raw)
To: Lance Yang
Cc: linmiaohe, nao.horiguchi, akpm, corbet, skhan, david, ljs,
Liam.Howlett, vbabka, rppt, surenb, mhocko, shuah, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kernel-team
On Mon, Apr 27, 2026 at 08:33:30PM +0800, Lance Yang wrote:
>
> On Fri, Apr 24, 2026 at 05:23:59AM -0700, Breno Leitao wrote:
> >When get_hwpoison_page() returns a negative value, distinguish
> >reserved pages from other failure cases by reporting MF_MSG_KERNEL
> >instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
> >and should be classified accordingly for proper handling.
> >
> >Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> >Signed-off-by: Breno Leitao <leitao@debian.org>
> >---
> > mm/memory-failure.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> >diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >index ee42d43613097..7b67e43dafbd1 100644
> >--- a/mm/memory-failure.c
> >+++ b/mm/memory-failure.c
> >@@ -2432,7 +2432,16 @@ int memory_failure(unsigned long pfn, int flags)
> > }
> > goto unlock_mutex;
> > } else if (res < 0) {
> >- res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> >+ /*
> >+ * PageReserved is stable here: reserved pages have
> >+ * PG_reserved set at boot or by drivers and are never
> >+ * freed through the page allocator.
> >+ */
>
> Not necessarily. PG_reserved is not a permanent lifetime property for
> every page that has carried it.
>
> page-flags.h says early reserved pages may later have PG_reserved
> cleared and then be given to the page allocator :)
>
> At least some drivers also clear PG_reserved when releasing pages they
> marked reserved.
>
> Would it be clearer to say that pages with PG_reserved set are not
> currently managed by the page allocator, rather than saying reserved
> pages are never freed through the page allocator?
Would a comment like the following look better?
/*
* Pages with PG_reserved set are not currently managed by the
* page allocator (memblock-reserved memory, driver reservations,
* etc.), so classify them as kernel-owned for reporting.
*/
if (PageReserved(p))
res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
Thanks for the review,
--breno
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
2026-04-27 14:45 ` Breno Leitao
@ 2026-04-27 15:14 ` Lance Yang
0 siblings, 0 replies; 18+ messages in thread
From: Lance Yang @ 2026-04-27 15:14 UTC (permalink / raw)
To: Breno Leitao
Cc: linmiaohe, nao.horiguchi, akpm, corbet, skhan, david, ljs,
Liam.Howlett, vbabka, rppt, surenb, mhocko, shuah, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kernel-team
On 2026/4/27 22:45, Breno Leitao wrote:
> On Mon, Apr 27, 2026 at 08:33:30PM +0800, Lance Yang wrote:
>>
>> On Fri, Apr 24, 2026 at 05:23:59AM -0700, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/memory-failure.c | 11 ++++++++++-
>>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index ee42d43613097..7b67e43dafbd1 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2432,7 +2432,16 @@ int memory_failure(unsigned long pfn, int flags)
>>> }
>>> goto unlock_mutex;
>>> } else if (res < 0) {
>>> - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> + /*
>>> + * PageReserved is stable here: reserved pages have
>>> + * PG_reserved set at boot or by drivers and are never
>>> + * freed through the page allocator.
>>> + */
>>
>> Not necessarily. PG_reserved is not a permanent lifetime property for
>> every page that has carried it.
>>
>> page-flags.h says early reserved pages may later have PG_reserved
>> cleared and then be given to the page allocator :)
>>
>> At least some drivers also clear PG_reserved when releasing pages they
>> marked reserved.
>>
>> Would it be clearer to say that pages with PG_reserved set are not
>> currently managed by the page allocator, rather than saying reserved
>> pages are never freed through the page allocator?
>
> Would a comment like the following look better?
>
> /*
> * Pages with PG_reserved set are not currently managed by the
> * page allocator (memblock-reserved memory, driver reservations,
> * etc.), so classify them as kernel-owned for reporting.
> */
> if (PageReserved(p))
> res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
Works for me, thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
2026-04-27 12:33 ` Lance Yang
2026-04-27 14:45 ` Breno Leitao
@ 2026-04-27 15:57 ` Lance Yang
1 sibling, 0 replies; 18+ messages in thread
From: Lance Yang @ 2026-04-27 15:57 UTC (permalink / raw)
To: leitao
Cc: linmiaohe, nao.horiguchi, akpm, corbet, skhan, david, ljs,
Liam.Howlett, vbabka, rppt, surenb, mhocko, shuah, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kernel-team, Lance Yang
On Mon, Apr 27, 2026 at 08:33:30PM +0800, Lance Yang wrote:
>
>On Fri, Apr 24, 2026 at 05:23:59AM -0700, Breno Leitao wrote:
>>When get_hwpoison_page() returns a negative value, distinguish
>>reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>and should be classified accordingly for proper handling.
>>
>>Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>Signed-off-by: Breno Leitao <leitao@debian.org>
>>---
>> mm/memory-failure.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>>diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>index ee42d43613097..7b67e43dafbd1 100644
>>--- a/mm/memory-failure.c
>>+++ b/mm/memory-failure.c
>>@@ -2432,7 +2432,16 @@ int memory_failure(unsigned long pfn, int flags)
>> }
>> goto unlock_mutex;
>> } else if (res < 0) {
>>- res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>+ /*
>>+ * PageReserved is stable here: reserved pages have
>>+ * PG_reserved set at boot or by drivers and are never
>>+ * freed through the page allocator.
>>+ */
>
>Not necessarily. PG_reserved is not a permanent lifetime property for
>every page that has carried it.
>
>page-flags.h says early reserved pages may later have PG_reserved
>cleared and then be given to the page allocator :)
>
>At least some drivers also clear PG_reserved when releasing pages they
>marked reserved.
>
>Would it be clearer to say that pages with PG_reserved set are not
>currently managed by the page allocator, rather than saying reserved
>pages are never freed through the page allocator?
>
>Otherwise, LGTM.
Ouch, I missed one more thing ...
Sashiko pointed out that[1]
> + if (PageReserved(p))
"Can this introduce a use-after-free risk on the struct page?"
get_any_page() may put the page before returning -EIO. After that ref is
dropped, PageReserved(p) is not safe, IIUC :(
Maybe just cache it before the call?
is_reserved = PageReserved(p);
res = get_hwpoison_page(p, flags);
[1] https://sashiko.dev/#/patchset/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-24 12:23 ` [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
@ 2026-04-24 12:24 ` Breno Leitao
2026-04-27 15:49 ` David Hildenbrand (Arm)
2026-04-24 12:24 ` [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
` (3 subsequent siblings)
5 siblings, 1 reply; 18+ messages in thread
From: Breno Leitao @ 2026-04-24 12:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
kernel-team
Add a sysctl panic_on_unrecoverable_memory_failure that triggers a
kernel panic when memory_failure() encounters pages that cannot be
recovered. This provides a clean crash with useful debug information
rather than allowing silent data corruption or a delayed crash at an
unrelated code path.
The panic is triggered for three categories of unrecoverable failures,
all requiring result == MF_IGNORED:
- MF_MSG_KERNEL: reserved pages identified via PageReserved.
- MF_MSG_KERNEL_HIGH_ORDER: pages that get_hwpoison_page() observed
with refcount 0 but that are not in the buddy allocator (e.g. tail
pages of a high-order kernel allocation). A buddy page being
concurrently allocated to userspace can briefly land on this branch
too — its refcount is 0 inside the allocator and it is no longer on
the buddy free list — and panicking on such a page would defeat the
standard SIGBUS recovery path. The page allocator cannot reject
hwpoisoned buddy pages reliably either: check_new_pages() is gated by
is_check_pages_enabled() and is a no-op when CONFIG_DEBUG_VM=n.
Rule out the race inside panic_on_unrecoverable_mf(): yield with
cpu_relax() so a concurrent allocator on another CPU can finish
prep_new_page() and have its writes become visible, then re-check.
A genuine high-order kernel tail page stays unowned (refcount 0,
no LRU, no mapping, not in buddy); an in-flight allocation will
have bumped the refcount, attached a mapping, or placed the page
on an LRU by then. Only panic if the recheck still observes a
fully unowned page. The window is narrowed, not eliminated, but
is far below any allocator path's cost.
- MF_MSG_UNKNOWN: pages that do not match any known recoverable state
in error_states[]. A theoretical false positive from concurrent LRU
isolation is mitigated by identify_page_state()'s two-pass design
which rechecks using saved page_flags.
MF_MSG_GET_HWPOISON is intentionally excluded: it covers both
non-reserved kernel memory (SLAB/SLUB, vmalloc, kernel stacks, page
tables) and transient refcount races, so panicking would risk false
positives.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 91 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7b67e43dafbd1..fd1aed1af94a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
+ },
+ {
+ .procname = "panic_on_unrecoverable_memory_failure",
+ .data = &sysctl_panic_on_unrecoverable_mf,
+ .maxlen = sizeof(sysctl_panic_on_unrecoverable_mf),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
}
};
@@ -1281,6 +1292,75 @@ static void update_per_node_mf_stats(unsigned long pfn,
++mf_stats->total;
}
+/*
+ * Determine whether to panic on an unrecoverable memory failure.
+ *
+ * Panics on three categories of failures (all requiring result == MF_IGNORED):
+ *
+ * - MF_MSG_KERNEL: Reserved pages (PageReserved) that belong to the kernel.
+ *
+ * - MF_MSG_KERNEL_HIGH_ORDER: Pages that get_hwpoison_page() observed with
+ * refcount 0 but that are not in the buddy allocator (e.g. tail pages of
+ * a high-order kernel allocation). A buddy page being concurrently
+ * allocated could also reach this branch — its refcount is briefly 0
+ * inside the allocator and it is no longer on the buddy free list — and
+ * such a page may be destined for userspace, where the standard hwpoison
+ * path would recover it via SIGBUS. The page allocator cannot reject
+ * hwpoisoned buddy pages reliably either: check_new_pages() is gated by
+ * is_check_pages_enabled() and is a no-op when CONFIG_DEBUG_VM=n. The
+ * recheck below rules out this race before panicking.
+ *
+ * - MF_MSG_UNKNOWN: Pages that reached identify_page_state() but matched no
+ * recoverable state in error_states[]. A theoretical false positive from
+ * concurrent LRU isolation is mitigated by identify_page_state()'s
+ * two-pass design which rechecks using saved page_flags.
+ *
+ * MF_MSG_GET_HWPOISON is intentionally excluded: it covers dynamically
+ * allocated kernel memory (SLAB/SLUB, vmalloc, kernel stacks, page tables)
+ * which shares the return path with transient refcount races, so panicking
+ * would risk false positives.
+ */
+static bool panic_on_unrecoverable_mf(unsigned long pfn,
+ enum mf_action_page_type type,
+ enum mf_result result)
+{
+ struct page *p;
+
+ if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
+ return false;
+
+ switch (type) {
+ case MF_MSG_KERNEL:
+ case MF_MSG_UNKNOWN:
+ return true;
+ case MF_MSG_KERNEL_HIGH_ORDER:
+ /*
+ * Rule out a concurrent buddy allocation: give the
+ * allocator a moment to finish prep_new_page() and
+ * re-check. A genuine high-order kernel tail page stays
+ * unowned; an in-flight allocation will have bumped the
+ * refcount, attached a mapping, or placed the page on
+ * an LRU by now.
+ */
+ p = pfn_to_online_page(pfn);
+ if (!p)
+ return true;
+ /*
+ * Yield so a concurrent allocator on another CPU can
+ * finish prep_new_page() and have its writes become
+ * visible before we resample the page state.
+ */
+ cpu_relax();
+ return page_count(p) == 0 &&
+ !PageLRU(p) &&
+ !page_mapped(p) &&
+ !page_folio(p)->mapping &&
+ !is_free_buddy_page(p);
+ default:
+ return false;
+ }
+}
+
/*
* "Dirty/Clean" indication is not 100% accurate due to the possibility of
* setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1298,6 +1378,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
pr_err("%#lx: recovery action for %s: %s\n",
pfn, action_page_types[type], action_name[result]);
+ if (panic_on_unrecoverable_mf(pfn, type, result))
+ panic("Memory failure: %#lx: unrecoverable page", pfn);
+
return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
}
@@ -2428,6 +2511,14 @@ int memory_failure(unsigned long pfn, int flags)
}
res = action_result(pfn, MF_MSG_BUDDY, res);
} else {
+ /*
+ * The page has refcount 0 but is not in the buddy
+ * allocator — typically a tail page of a high-order
+ * kernel allocation. A buddy page being concurrently
+ * allocated to userspace can also briefly land here;
+ * panic_on_unrecoverable_mf() rechecks to rule that
+ * out before triggering a panic.
+ */
res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
}
goto unlock_mutex;
--
2.52.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages
2026-04-24 12:24 ` [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
@ 2026-04-27 15:49 ` David Hildenbrand (Arm)
2026-04-28 3:07 ` Lance Yang
0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-27 15:49 UTC (permalink / raw)
To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, kernel-team
> + switch (type) {
> + case MF_MSG_KERNEL:
> + case MF_MSG_UNKNOWN:
> + return true;
> + case MF_MSG_KERNEL_HIGH_ORDER:
> + /*
> + * Rule out a concurrent buddy allocation: give the
> + * allocator a moment to finish prep_new_page() and
> + * re-check. A genuine high-order kernel tail page stays
> + * unowned; an in-flight allocation will have bumped the
> + * refcount, attached a mapping, or placed the page on
> + * an LRU by now.
> + */
> + p = pfn_to_online_page(pfn);
> + if (!p)
> + return true;
> + /*
> + * Yield so a concurrent allocator on another CPU can
> + * finish prep_new_page() and have its writes become
> + * visible before we resample the page state.
> + */
> + cpu_relax();
> + return page_count(p) == 0 &&
> + !PageLRU(p) &&
> + !page_mapped(p) &&
> + !page_folio(p)->mapping &&
> + !is_free_buddy_page(p);
I don't get what you are doing here. The right way to check for a tail page is
not by checking the refcount.
Further, you are not holding a folio reference? If so, calling
page_mapped/folio_mapped is shaky. On concurrent folio split you can trigger a
VM_WARN_ON_FOLIO().
Maybe folio_snapshot() is what you are looking for, if you are in fact not
holding a reference?
--
Cheers,
David
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages
2026-04-27 15:49 ` David Hildenbrand (Arm)
@ 2026-04-28 3:07 ` Lance Yang
2026-05-06 16:18 ` Breno Leitao
0 siblings, 1 reply; 18+ messages in thread
From: Lance Yang @ 2026-04-28 3:07 UTC (permalink / raw)
To: leitao, david
Cc: linmiaohe, nao.horiguchi, akpm, corbet, skhan, ljs, Liam.Howlett,
vbabka, rppt, surenb, mhocko, shuah, linux-mm, linux-kernel,
linux-doc, linux-kselftest, kernel-team, Lance Yang
On Mon, Apr 27, 2026 at 05:49:28PM +0200, David Hildenbrand (Arm) wrote:
>> + switch (type) {
>> + case MF_MSG_KERNEL:
>> + case MF_MSG_UNKNOWN:
>> + return true;
>> + case MF_MSG_KERNEL_HIGH_ORDER:
>> + /*
>> + * Rule out a concurrent buddy allocation: give the
>> + * allocator a moment to finish prep_new_page() and
>> + * re-check. A genuine high-order kernel tail page stays
>> + * unowned; an in-flight allocation will have bumped the
>> + * refcount, attached a mapping, or placed the page on
>> + * an LRU by now.
>> + */
>> + p = pfn_to_online_page(pfn);
>> + if (!p)
>> + return true;
>> + /*
>> + * Yield so a concurrent allocator on another CPU can
>> + * finish prep_new_page() and have its writes become
>> + * visible before we resample the page state.
>> + */
>> + cpu_relax();
>> + return page_count(p) == 0 &&
>> + !PageLRU(p) &&
>> + !page_mapped(p) &&
>> + !page_folio(p)->mapping &&
>> + !is_free_buddy_page(p);
>
>I don't get what you are doing here. The right way to check for a tail page is
>not by checking the refcount.
>
>Further, you are not holding a folio reference? If so, calling
>page_mapped/folio_mapped is shaky. On concurrent folio split you can trigger a
>VM_WARN_ON_FOLIO().
>
>
>Maybe folio_snapshot() is what you are looking for, if you are in fact not
>holding a reference?
Right! Maybe we should not try to make this decision in
panic_on_unrecoverable_mf().
By the time we get here, we only know the final MF_MSG_* type. The
real reason why get_hwpoison_page() failed is already lost.
Wonder if it would be better to split that earlier, around
__get_unpoison_page()/get_any_page(). That code still knows why
grabbing the page failed, either an unsupported kernel page or
just a temporary race we cannot really trust :)
Then the later panic logic can be simple: panic for the stable
unsupported kernel page case, and not for the temporary race case.
That would also avoid trying to guess MF_MSG_KERNEL_HIGH_ORDER here:)
Cheers,
Lance
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages
2026-04-28 3:07 ` Lance Yang
@ 2026-05-06 16:18 ` Breno Leitao
0 siblings, 0 replies; 18+ messages in thread
From: Breno Leitao @ 2026-05-06 16:18 UTC (permalink / raw)
To: Lance Yang
Cc: david, linmiaohe, nao.horiguchi, akpm, corbet, skhan, ljs,
Liam.Howlett, vbabka, rppt, surenb, mhocko, shuah, linux-mm,
linux-kernel, linux-doc, linux-kselftest, kernel-team
On Tue, Apr 28, 2026 at 11:07:21AM +0800, Lance Yang wrote:
>
> On Mon, Apr 27, 2026 at 05:49:28PM +0200, David Hildenbrand (Arm) wrote:
> >> + switch (type) {
> >> + case MF_MSG_KERNEL:
> >> + case MF_MSG_UNKNOWN:
> >> + return true;
> >> + case MF_MSG_KERNEL_HIGH_ORDER:
> >> + /*
> >> + * Rule out a concurrent buddy allocation: give the
> >> + * allocator a moment to finish prep_new_page() and
> >> + * re-check. A genuine high-order kernel tail page stays
> >> + * unowned; an in-flight allocation will have bumped the
> >> + * refcount, attached a mapping, or placed the page on
> >> + * an LRU by now.
> >> + */
> >> + p = pfn_to_online_page(pfn);
> >> + if (!p)
> >> + return true;
> >> + /*
> >> + * Yield so a concurrent allocator on another CPU can
> >> + * finish prep_new_page() and have its writes become
> >> + * visible before we resample the page state.
> >> + */
> >> + cpu_relax();
> >> + return page_count(p) == 0 &&
> >> + !PageLRU(p) &&
> >> + !page_mapped(p) &&
> >> + !page_folio(p)->mapping &&
> >> + !is_free_buddy_page(p);
> >
> >I don't get what you are doing here. The right way to check for a tail page is
> >not by checking the refcount.
> >
> >Further, you are not holding a folio reference? If so, calling
> >page_mapped/folio_mapped is shaky. On concurrent folio split you can trigger a
> >VM_WARN_ON_FOLIO().
> >
> >
> >Maybe folio_snapshot() is what you are looking for, if you are in fact not
> >holding a reference?
>
> Right! Maybe we should not try to make this decision in
> panic_on_unrecoverable_mf().
>
> By the time we get here, we only know the final MF_MSG_* type. The
> real reason why get_hwpoison_page() failed is already lost.
>
> Wonder if it would be better to split that earlier, around
> __get_unpoison_page()/get_any_page(). That code still knows why
> grabbing the page failed, either an unsupported kernel page or
> just a temporary race we cannot really trust :)
>
> Then the later panic logic can be simple: panic for the stable
> unsupported kernel page case, and not for the temporary race case.
>
> That would also avoid trying to guess MF_MSG_KERNEL_HIGH_ORDER here:)
This is a very good feedback, and definitely what I wanted to do, but,
failed. Once we have the reason, we don't need this dance to guess the
reason.
I've hacked a patch based on this approach. How does it sound?
commit ae7a09c989afe7aaed7ac4b5090d993ef1de0b38
Author: Breno Leitao <leitao@debian.org>
Date: Wed May 6 07:41:30 2026 -0700
mm/memory-failure: classify get_any_page() failures by reason
When get_any_page() fails to grab a page reference, the *reason* it
failed is known at the call site but is not surfaced to callers: the
HWPoisonHandlable() rejection path (a stable kernel page hwpoison cannot
handle — slab, vmalloc, page tables, kernel stacks, ...) and the
page_count() / put_page race paths (a transient page-allocator lifecycle
race) all collapse to a single negative errno by the time
memory_failure() sees them. memory_failure() can only observe the
conflated result and reports both as MF_MSG_GET_HWPOISON.
Surface the diagnosis explicitly. Add an mf_get_page_status enum,
plumbed out through get_any_page() and get_hwpoison_page() (NULL is
accepted by callers that do not care — unpoison_memory() and
soft_offline_page() pass NULL). get_any_page() sets the status at the
moment it gives up:
MF_GET_PAGE_UNHANDLABLE — HWPoisonHandlable() rejected the page
after retries.
MF_GET_PAGE_RACE — exhausted retries on a refcount /
lifecycle race with the allocator.
memory_failure() then promotes the unhandlable case to MF_MSG_KERNEL
alongside the existing PageReserved branch, and leaves the
transient-race case as MF_MSG_GET_HWPOISON. The user-visible report
now distinguishes the two; this also forms the foundation a later
patch will rely on to decide whether an unrecoverable failure should
panic.
Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f112fb27a8ff6..a83fabadbce99 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1389,7 +1389,32 @@ static int __get_hwpoison_page(struct page *page, unsigned long flags)
#define GET_PAGE_MAX_RETRY_NUM 3
-static int get_any_page(struct page *p, unsigned long flags)
+/*
+ * Diagnosis of why get_any_page() failed to grab a page reference.
+ *
+ * Set when ret < 0 so callers (notably memory_failure()) can tell apart
+ * a stable kernel page type that hwpoison cannot handle — slab, vmalloc,
+ * page tables, kernel stacks, etc. — from a transient race with the page
+ * allocator lifecycle (allocation/free in flight). The distinction
+ * matters for panic_on_unrecoverable_mf(): the former is a real
+ * unrecoverable kernel-owned poisoning, the latter must not panic since
+ * the page may be destined for userspace where SIGBUS recovery would
+ * otherwise apply.
+ */
+enum mf_get_page_status {
+ MF_GET_PAGE_OK = 0,
+ /*
+ * Transient lifecycle race with the page allocator. Recorded for
+ * symmetry and for future callers that may want to distinguish a
+ * race from an unhandlable kernel page; no in-tree caller acts on
+ * this value yet.
+ */
+ MF_GET_PAGE_RACE,
+ MF_GET_PAGE_UNHANDLABLE, /* stable kernel page hwpoison cannot handle */
+};
+
+static int get_any_page(struct page *p, unsigned long flags,
+ enum mf_get_page_status *status)
{
int ret = 0, pass = 0;
bool count_increased = false;
@@ -1406,11 +1431,15 @@ static int get_any_page(struct page *p, unsigned long flags)
if (pass++ < GET_PAGE_MAX_RETRY_NUM)
goto try_again;
ret = -EBUSY;
+ if (status)
+ *status = MF_GET_PAGE_RACE;
} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
/* We raced with put_page, retry. */
if (pass++ < GET_PAGE_MAX_RETRY_NUM)
goto try_again;
ret = -EIO;
+ if (status)
+ *status = MF_GET_PAGE_RACE;
}
goto out;
} else if (ret == -EBUSY) {
@@ -1423,6 +1452,8 @@ static int get_any_page(struct page *p, unsigned long flags)
goto try_again;
}
ret = -EIO;
+ if (status)
+ *status = MF_GET_PAGE_UNHANDLABLE;
goto out;
}
}
@@ -1442,6 +1473,8 @@ static int get_any_page(struct page *p, unsigned long flags)
}
put_page(p);
ret = -EIO;
+ if (status)
+ *status = MF_GET_PAGE_UNHANDLABLE;
}
out:
if (ret == -EIO)
@@ -1503,7 +1536,8 @@ static int __get_unpoison_page(struct page *page)
* operations like allocation and free,
* -EHWPOISON when the page is hwpoisoned and taken off from buddy.
*/
-static int get_hwpoison_page(struct page *p, unsigned long flags)
+static int get_hwpoison_page(struct page *p, unsigned long flags,
+ enum mf_get_page_status *status)
{
int ret;
@@ -1511,7 +1545,7 @@ static int get_hwpoison_page(struct page *p, unsigned long flags)
if (flags & MF_UNPOISON)
ret = __get_unpoison_page(p);
else
- ret = get_any_page(p, flags);
+ ret = get_any_page(p, flags, status);
zone_pcp_enable(page_zone(p));
return ret;
@@ -2349,6 +2383,7 @@ int memory_failure(unsigned long pfn, int flags)
bool retry = true;
int hugetlb = 0;
bool is_reserved;
+ enum mf_get_page_status gp_status = MF_GET_PAGE_OK;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -2424,7 +2459,7 @@ int memory_failure(unsigned long pfn, int flags)
*/
is_reserved = PageReserved(p);
- res = get_hwpoison_page(p, flags);
+ res = get_hwpoison_page(p, flags, &gp_status);
if (!res) {
if (is_free_buddy_page(p)) {
if (take_page_off_buddy(p)) {
@@ -2445,7 +2480,12 @@ int memory_failure(unsigned long pfn, int flags)
}
goto unlock_mutex;
} else if (res < 0) {
- if (is_reserved)
+ /*
+ * Promote a stable unhandlable kernel page diagnosed by
+ * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
+ * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
+ */
+ if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
else
res = action_result(pfn, MF_MSG_GET_HWPOISON,
@@ -2750,7 +2790,7 @@ int unpoison_memory(unsigned long pfn)
goto unlock_mutex;
}
- ghp = get_hwpoison_page(p, MF_UNPOISON);
+ ghp = get_hwpoison_page(p, MF_UNPOISON, NULL);
if (!ghp) {
if (folio_test_hugetlb(folio)) {
huge = true;
@@ -2957,7 +2997,7 @@ int soft_offline_page(unsigned long pfn, int flags)
retry:
get_online_mems();
- ret = get_hwpoison_page(page, flags | MF_SOFT_OFFLINE);
+ ret = get_hwpoison_page(page, flags | MF_SOFT_OFFLINE, NULL);
put_online_mems();
if (hwpoison_filter(page)) {
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-24 12:23 ` [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-04-24 12:24 ` [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
@ 2026-04-24 12:24 ` Breno Leitao
2026-04-24 12:48 ` Andrew Morton
2026-04-24 12:24 ` [PATCH v5 4/4] selftests/mm: regression test for panic_on_unrecoverable_memory_failure Breno Leitao
` (2 subsequent siblings)
5 siblings, 1 reply; 18+ messages in thread
From: Breno Leitao @ 2026-04-24 12:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
kernel-team
Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing the three categories of failures that trigger a
panic and noting which kernel page types are not yet covered.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Documentation/admin-guide/sysctl/vm.rst | 65 +++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c9..f118ec5cd1fad 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
- page-cluster
- page_lock_unfairness
- panic_on_oom
+- panic_on_unrecoverable_memory_failure
- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
@@ -925,6 +926,70 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation. This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on three categories of
+unrecoverable failures: reserved kernel pages, non-buddy kernel pages
+with zero refcount (e.g. tail pages of high-order allocations), and
+pages whose state cannot be classified as recoverable.
+
+Note that some kernel page types — such as slab objects, vmalloc
+allocations, kernel stacks, and page tables — share a failure path
+with transient refcount races and are not currently covered by this
+option. I.e, do not panic when not confident of the page status.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+ regularly and post-mortem analysis of an unrelated downstream crash
+ (often seconds to minutes after the original error) consumes
+ significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+ hardware error produces a vmcore that still contains the faulting
+ address, the affected page state, and the originating MCE/GHES
+ record — context that is typically lost by the time a delayed crash
+ occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+ failure for failover, and prefer an immediate panic over silent data
+ corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+ tools such as ``mce-inject`` or error-injection debugfs interfaces,
+ where panicking on the unrecoverable path makes regressions
+ immediately visible instead of surfacing as later, unrelated
+ failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately. If the ``panic`` sysctl is also non-zero then the
+ machine will be rebooted.
+= =====================================================================
+
+Example::
+
+ echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
percpu_pagelist_high_fraction
=============================
--
2.52.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl
2026-04-24 12:24 ` [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
@ 2026-04-24 12:48 ` Andrew Morton
2026-05-06 15:38 ` Breno Leitao
0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2026-04-24 12:48 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kernel-team
On Fri, 24 Apr 2026 05:24:01 -0700 Breno Leitao <leitao@debian.org> wrote:
> Add documentation for the new vm.panic_on_unrecoverable_memory_failure
> sysctl, describing the three categories of failures that trigger a
> panic and noting which kernel page types are not yet covered.
>
>
> ...
>
> +When enabled, this sysctl triggers a panic on three categories of
> +unrecoverable failures: reserved kernel pages, non-buddy kernel pages
> +with zero refcount (e.g. tail pages of high-order allocations), and
> +pages whose state cannot be classified as recoverable.
Before someone asks, I wonder if we should make this a bitfield thing,
so people can select which of the above three should get the panic
treatment.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl
2026-04-24 12:48 ` Andrew Morton
@ 2026-05-06 15:38 ` Breno Leitao
0 siblings, 0 replies; 18+ messages in thread
From: Breno Leitao @ 2026-05-06 15:38 UTC (permalink / raw)
To: Andrew Morton
Cc: Miaohe Lin, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kernel-team
On Fri, Apr 24, 2026 at 05:48:40AM -0700, Andrew Morton wrote:
> On Fri, 24 Apr 2026 05:24:01 -0700 Breno Leitao <leitao@debian.org> wrote:
>
> > Add documentation for the new vm.panic_on_unrecoverable_memory_failure
> > sysctl, describing the three categories of failures that trigger a
> > panic and noting which kernel page types are not yet covered.
> >
> >
> > ...
> >
> > +When enabled, this sysctl triggers a panic on three categories of
> > +unrecoverable failures: reserved kernel pages, non-buddy kernel pages
> > +with zero refcount (e.g. tail pages of high-order allocations), and
> > +pages whose state cannot be classified as recoverable.
>
> Before someone asks, I wonder if we should make this a bitfield thing,
> so people can select which of the above three should get the panic
> treatment.
That's an interesting idea, though I think the necessary infrastructure
doesn't exist yet. As discussed in this thread, even distinguishing
non-userspace pages from userspace pages presents non-trivial challenges.
Implementing a bitfield-based approach would require significant
groundwork. If we want to pursue that direction, it might make sense to
defer this patchset and focus on building that infrastructure first.
My preference would be to start with the coarse-grained approach
(current approach) and refine it incrementally based on actual needs.
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v5 4/4] selftests/mm: regression test for panic_on_unrecoverable_memory_failure
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
` (2 preceding siblings ...)
2026-04-24 12:24 ` [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
@ 2026-04-24 12:24 ` Breno Leitao
2026-04-28 2:22 ` Miaohe Lin
2026-04-24 13:19 ` [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Matthew Wilcox
2026-04-24 13:28 ` Andrew Morton
5 siblings, 1 reply; 18+ messages in thread
From: Breno Leitao @ 2026-04-24 12:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
kernel-team
Add a test that enables vm.panic_on_unrecoverable_memory_failure and
injects MADV_HWPOISON on a userspace anonymous page. The page must
still be recovered via SIGBUS — it must not trigger a kernel panic.
This is the regression test for the panic_on_unrecoverable_mf()
recheck: a buddy page being concurrently allocated to userspace can
briefly land on the MF_MSG_KERNEL_HIGH_ORDER branch (refcount 0, not in
buddy), and without the recheck the kernel would panic on what is
actually a recoverable userspace page.
Run in a forked child so the SIGBUS path is fully exercised; if the
kernel ever regresses and panics, the host VM dies and the harness
reports the binary as never returning, which is itself a clear
failure signal.
Skips when the sysctl is not present (feature not built in) or when
the test cannot write to it (insufficient privilege). Saves and
restores the original sysctl value.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
tools/testing/selftests/mm/memory-failure.c | 84 +++++++++++++++++++++++++++++
1 file changed, 84 insertions(+)
diff --git a/tools/testing/selftests/mm/memory-failure.c b/tools/testing/selftests/mm/memory-failure.c
index 032ed952057c6..9cb8d694aee94 100644
--- a/tools/testing/selftests/mm/memory-failure.c
+++ b/tools/testing/selftests/mm/memory-failure.c
@@ -17,9 +17,13 @@
#include <sys/vfs.h>
#include <linux/magic.h>
#include <errno.h>
+#include <sys/wait.h>
+#include <stdlib.h>
#include "vm_util.h"
+#define PANIC_SYSCTL "/proc/sys/vm/panic_on_unrecoverable_memory_failure"
+
enum inject_type {
MADV_HARD,
MADV_SOFT,
@@ -355,4 +359,84 @@ TEST_F(memory_failure, dirty_pagecache)
ASSERT_EQ(close(fd), 0);
}
+static int read_sysctl_int(const char *path, int *out)
+{
+ char buf[16];
+ int fd, n;
+
+ fd = open(path, O_RDONLY);
+ if (fd < 0)
+ return -1;
+ n = read(fd, buf, sizeof(buf) - 1);
+ close(fd);
+ if (n <= 0)
+ return -1;
+ buf[n] = '\0';
+ *out = atoi(buf);
+ return 0;
+}
+
+static int write_sysctl_int(const char *path, int val)
+{
+ char buf[16];
+ int fd, len, ret = 0;
+
+ fd = open(path, O_WRONLY);
+ if (fd < 0)
+ return -1;
+ len = snprintf(buf, sizeof(buf), "%d\n", val);
+ if (write(fd, buf, len) != len)
+ ret = -1;
+ close(fd);
+ return ret;
+}
+
+/*
+ * Regression test for vm.panic_on_unrecoverable_memory_failure.
+ *
+ * With the sysctl on, hwpoison injection on a userspace anonymous page
+ * must still be recovered via SIGBUS — it must not trigger a kernel
+ * panic. This guards the panic_on_unrecoverable_mf() recheck that rules
+ * out concurrent buddy allocations being misclassified as unrecoverable
+ * kernel pages (MF_MSG_KERNEL_HIGH_ORDER).
+ *
+ * If the kernel regresses and panics, the host VM dies and the test
+ * harness will report the binary as never having returned — which is
+ * itself a clear failure signal.
+ */
+TEST(panic_on_unrecoverable_user_page)
+{
+ unsigned long page_size;
+ int saved, status;
+ void *addr;
+ pid_t pid;
+
+ if (read_sysctl_int(PANIC_SYSCTL, &saved))
+ SKIP(return, "%s not available\n", PANIC_SYSCTL);
+ if (write_sysctl_int(PANIC_SYSCTL, 1))
+ SKIP(return, "cannot enable %s (need root?)\n", PANIC_SYSCTL);
+
+ page_size = sysconf(_SC_PAGESIZE);
+
+ pid = fork();
+ ASSERT_NE(pid, -1);
+ if (pid == 0) {
+ addr = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (addr == MAP_FAILED)
+ _exit(1);
+ *(volatile char *)addr = 1;
+ if (madvise(addr, page_size, MADV_HWPOISON))
+ _exit(2);
+ FORCE_READ(*(volatile char *)addr);
+ _exit(0); /* unreachable: SIGBUS expected */
+ }
+
+ ASSERT_EQ(waitpid(pid, &status, 0), pid);
+ write_sysctl_int(PANIC_SYSCTL, saved);
+
+ ASSERT_TRUE(WIFSIGNALED(status));
+ ASSERT_EQ(WTERMSIG(status), SIGBUS);
+}
+
TEST_HARNESS_MAIN
--
2.52.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v5 4/4] selftests/mm: regression test for panic_on_unrecoverable_memory_failure
2026-04-24 12:24 ` [PATCH v5 4/4] selftests/mm: regression test for panic_on_unrecoverable_memory_failure Breno Leitao
@ 2026-04-28 2:22 ` Miaohe Lin
0 siblings, 0 replies; 18+ messages in thread
From: Miaohe Lin @ 2026-04-28 2:22 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, kernel-team,
Naoya Horiguchi, Andrew Morton, Jonathan Corbet, Shuah Khan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan
On 2026/4/24 20:24, Breno Leitao wrote:
> Add a test that enables vm.panic_on_unrecoverable_memory_failure and
> injects MADV_HWPOISON on a userspace anonymous page. The page must
> still be recovered via SIGBUS — it must not trigger a kernel panic.
>
> This is the regression test for the panic_on_unrecoverable_mf()
> recheck: a buddy page being concurrently allocated to userspace can
> briefly land on the MF_MSG_KERNEL_HIGH_ORDER branch (refcount 0, not in
> buddy), and without the recheck the kernel would panic on what is
> actually a recoverable userspace page.
>
> Run in a forked child so the SIGBUS path is fully exercised; if the
> kernel ever regresses and panics, the host VM dies and the harness
> reports the binary as never returning, which is itself a clear
> failure signal.
>
> Skips when the sysctl is not present (feature not built in) or when
> the test cannot write to it (insufficient privilege). Saves and
> restores the original sysctl value.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Thanks for adding a selftest. Some comments below.
> ---
> tools/testing/selftests/mm/memory-failure.c | 84 +++++++++++++++++++++++++++++
> 1 file changed, 84 insertions(+)
>
> diff --git a/tools/testing/selftests/mm/memory-failure.c b/tools/testing/selftests/mm/memory-failure.c
> index 032ed952057c6..9cb8d694aee94 100644
> --- a/tools/testing/selftests/mm/memory-failure.c
> +++ b/tools/testing/selftests/mm/memory-failure.c
> @@ -17,9 +17,13 @@
> #include <sys/vfs.h>
> #include <linux/magic.h>
> #include <errno.h>
> +#include <sys/wait.h>
> +#include <stdlib.h>
>
> #include "vm_util.h"
>
> +#define PANIC_SYSCTL "/proc/sys/vm/panic_on_unrecoverable_memory_failure"
> +
> enum inject_type {
> MADV_HARD,
> MADV_SOFT,
> @@ -355,4 +359,84 @@ TEST_F(memory_failure, dirty_pagecache)
> ASSERT_EQ(close(fd), 0);
> }
>
> +static int read_sysctl_int(const char *path, int *out)
> +{
> + char buf[16];
> + int fd, n;
> +
> + fd = open(path, O_RDONLY);
> + if (fd < 0)
> + return -1;
> + n = read(fd, buf, sizeof(buf) - 1);
> + close(fd);
> + if (n <= 0)
> + return -1;
> + buf[n] = '\0';
> + *out = atoi(buf);
> + return 0;
> +}
> +
> +static int write_sysctl_int(const char *path, int val)
> +{
> + char buf[16];
> + int fd, len, ret = 0;
> +
> + fd = open(path, O_WRONLY);
> + if (fd < 0)
> + return -1;
> + len = snprintf(buf, sizeof(buf), "%d\n", val);
> + if (write(fd, buf, len) != len)
> + ret = -1;
> + close(fd);
> + return ret;
> +}
There are write_sysfs and read_sysfs in vm_util.c. Can we reuse those?
> +
> +/*
> + * Regression test for vm.panic_on_unrecoverable_memory_failure.
> + *
> + * With the sysctl on, hwpoison injection on a userspace anonymous page
> + * must still be recovered via SIGBUS — it must not trigger a kernel
> + * panic. This guards the panic_on_unrecoverable_mf() recheck that rules
> + * out concurrent buddy allocations being misclassified as unrecoverable
> + * kernel pages (MF_MSG_KERNEL_HIGH_ORDER).
> + *
> + * If the kernel regresses and panics, the host VM dies and the test
> + * harness will report the binary as never having returned — which is
> + * itself a clear failure signal.
> + */
> +TEST(panic_on_unrecoverable_user_page)
> +{
> + unsigned long page_size;
> + int saved, status;
> + void *addr;
> + pid_t pid;
> +
> + if (read_sysctl_int(PANIC_SYSCTL, &saved))
> + SKIP(return, "%s not available\n", PANIC_SYSCTL);
> + if (write_sysctl_int(PANIC_SYSCTL, 1))
> + SKIP(return, "cannot enable %s (need root?)\n", PANIC_SYSCTL);
> +
> + page_size = sysconf(_SC_PAGESIZE);
> +
> + pid = fork();
> + ASSERT_NE(pid, -1);
> + if (pid == 0) {
> + addr = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
> + MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> + if (addr == MAP_FAILED)
> + _exit(1);
> + *(volatile char *)addr = 1;
> + if (madvise(addr, page_size, MADV_HWPOISON))
> + _exit(2);
> + FORCE_READ(*(volatile char *)addr);
> + _exit(0); /* unreachable: SIGBUS expected */
> + }
> +
> + ASSERT_EQ(waitpid(pid, &status, 0), pid);
> + write_sysctl_int(PANIC_SYSCTL, saved);
> +
> + ASSERT_TRUE(WIFSIGNALED(status));
> + ASSERT_EQ(WTERMSIG(status), SIGBUS);
> +}
Could you restructure this test using the similar format as other functions, e.g. TEST_F(memory_failure, anon), in
this file? It would be good to keep them in same style.
Thanks.
.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
` (3 preceding siblings ...)
2026-04-24 12:24 ` [PATCH v5 4/4] selftests/mm: regression test for panic_on_unrecoverable_memory_failure Breno Leitao
@ 2026-04-24 13:19 ` Matthew Wilcox
2026-04-24 14:39 ` Breno Leitao
2026-04-24 13:28 ` Andrew Morton
5 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2026-04-24 13:19 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kernel-team
On Fri, Apr 24, 2026 at 05:23:58AM -0700, Breno Leitao wrote:
> This is a common problem on large fleets. We frequently observe multi-bit ECC
> errors hitting kernel slab pages, where memory_failure() fails to recover them
> and the system crashes later at an unrelated code path, making root cause
> analysis unnecessarily difficult.
Who is "we"? Please attribute your patches to your employer by putting
their name in brackets after yours. My ~/.gitconfig has:
[user]
name = Matthew Wilcox (Oracle)
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages
2026-04-24 13:19 ` [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Matthew Wilcox
@ 2026-04-24 14:39 ` Breno Leitao
0 siblings, 0 replies; 18+ messages in thread
From: Breno Leitao @ 2026-04-24 14:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kernel-team
On Fri, Apr 24, 2026 at 02:19:13PM +0100, Matthew Wilcox wrote:
> On Fri, Apr 24, 2026 at 05:23:58AM -0700, Breno Leitao wrote:
> > This is a common problem on large fleets. We frequently observe multi-bit ECC
> > errors hitting kernel slab pages, where memory_failure() fails to recover them
> > and the system crashes later at an unrelated code path, making root cause
> > analysis unnecessarily difficult.
>
> Who is "we"? Please attribute your patches to your employer by putting
> their name in brackets after yours. My ~/.gitconfig has:
>
> [user]
> name = Matthew Wilcox (Oracle)
Thank you for the feedback. You're absolutely right - I avoided
"we" entirely since it lacks proper attribution. This is something
I catch before subimtting patches, but this one went through.
I'll update the commit message when I respin the series
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
` (4 preceding siblings ...)
2026-04-24 13:19 ` [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Matthew Wilcox
@ 2026-04-24 13:28 ` Andrew Morton
5 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2026-04-24 13:28 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, linux-mm, linux-kernel, linux-doc, linux-kselftest,
kernel-team
On Fri, 24 Apr 2026 05:23:58 -0700 Breno Leitao <leitao@debian.org> wrote:
> When the memory failure handler encounters an in-use kernel page that it
> cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> currently logs the error as "Ignored" and continues operation.
>
> This leaves corrupted data accessible to the kernel, which will inevitably
> cause either silent data corruption or a delayed crash when the poisoned memory
> is next accessed.
>
> This is a common problem on large fleets. We frequently observe multi-bit ECC
> errors hitting kernel slab pages, where memory_failure() fails to recover them
> and the system crashes later at an unrelated code path, making root cause
> analysis unnecessarily difficult.
>
> Here is one specific example from production on an arm64 server: a multi-bit
> ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> (slab pages are not supported by the hwpoison recovery mechanism), and 67
> seconds later d_lookup() accessed the poisoned cache line causing
> a synchronous external abort:
>
> [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
> [88690.498473] Memory failure: 0x40272d: unhandlable page.
> [88690.498619] Memory failure: 0x40272d: recovery action for
> get hwpoison page: Ignored
> ...
> [88757.847126] Internal error: synchronous external abort:
> 0000000096000410 [#1] SMP
> [88758.061075] pc : d_lookup+0x5c/0x220
>
> This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> (default 0) that, when enabled, panics immediately on unrecoverable
> memory failures. This provides a clean crash dump at the time of the
> error, which is far more useful for diagnosis than a random crash later
> at an unrelated code path.
Sashiko is asking things:
https://sashiko.dev/#/patchset/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org
^ permalink raw reply [flat|nested] 18+ messages in thread