* [PATCH v2 0/3] mm/memory-failure: add panic option for unrecoverable pages
@ 2026-03-31 11:00 Breno Leitao
2026-03-31 11:00 ` [PATCH v2 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Breno Leitao @ 2026-03-31 11:00 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team
When the memory failure handler encounters an in-use kernel page that it
cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
currently logs the error as "Ignored" and continues operation.
This leaves corrupted data accessible to the kernel, which will inevitably
cause either silent data corruption or a delayed crash when the poisoned memory
is next accessed.
This is a common problem on large fleets. We frequently observe multi-bit ECC
errors hitting kernel slab pages, where memory_failure() fails to recover them
and the system crashes later at an unrelated code path, making root cause
analysis unnecessarily difficult.
Here is one specific example from production on an arm64 server: a multi-bit
ECC error hit a dentry cache slab page, memory_failure() failed to recover it
(slab pages are not supported by the hwpoison recovery mechanism), and 67
seconds later d_lookup() accessed the poisoned cache line causing a synchronous
external abort:
[88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
[88690.498473] Memory failure: 0x40272d: unhandlable page.
[88690.498619] Memory failure: 0x40272d: recovery action for
get hwpoison page: Ignored
...
[88757.847126] Internal error: synchronous external abort:
0000000096000410 [#1] SMP
[88758.061075] pc : d_lookup+0x5c/0x220
This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
(default 0) that, when enabled, panics immediately on unrecoverable
memory failures. This provides a clean crash dump at the time of the
error, which is far more useful for diagnosis than a random crash later
at an unrelated code path.
This also categorizes reserved pages as MF_MSG_KERNEL, and panics on
unknown page types (MF_MSG_UNKNOWN), so all unrecoverable failure cases
are covered.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org
---
Breno Leitao (3):
mm/memory-failure: report MF_MSG_KERNEL for reserved pages
mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl
Documentation: document panic_on_unrecoverable_memory_failure sysctl
Documentation/admin-guide/sysctl/vm.rst | 27 +++++++++++++++++++++++++++
mm/memory-failure.c | 22 +++++++++++++++++++++-
2 files changed, 48 insertions(+), 1 deletion(-)
---
base-commit: c369299895a591d96745d6492d4888259b004a9e
change-id: 20260323-ecc_panic-4e473b83087c
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
2026-03-31 11:00 [PATCH v2 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
@ 2026-03-31 11:00 ` Breno Leitao
2026-03-31 11:00 ` [PATCH v2 2/3] mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-03-31 11:00 ` [PATCH v2 3/3] Documentation: document " Breno Leitao
2 siblings, 0 replies; 4+ messages in thread
From: Breno Leitao @ 2026-03-31 11:00 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team
When get_hwpoison_page() returns a negative value, distinguish
reserved pages from other failure cases by reporting MF_MSG_KERNEL
instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
and should be classified accordingly for proper handling by the
panic_on_unrecoverable_memory_failure mechanism.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..6ff80e01b91a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2432,7 +2432,11 @@ int memory_failure(unsigned long pfn, int flags)
}
goto unlock_mutex;
} else if (res < 0) {
- res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
+ if (PageReserved(p))
+ res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+ else
+ res = action_result(pfn, MF_MSG_GET_HWPOISON,
+ MF_IGNORED);
goto unlock_mutex;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v2 2/3] mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl
2026-03-31 11:00 [PATCH v2 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-03-31 11:00 ` [PATCH v2 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
@ 2026-03-31 11:00 ` Breno Leitao
2026-03-31 11:00 ` [PATCH v2 3/3] Documentation: document " Breno Leitao
2 siblings, 0 replies; 4+ messages in thread
From: Breno Leitao @ 2026-03-31 11:00 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team
Add a sysctl that allows the system to panic when an unrecoverable
memory failure is detected. This covers kernel pages, high-order
kernel pages, and unknown page types that cannot be recovered.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6ff80e01b91a..d0d911c54ff1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
+ },
+ {
+ .procname = "panic_on_unrecoverable_memory_failure",
+ .data = &sysctl_panic_on_unrecoverable_mf,
+ .maxlen = sizeof(sysctl_panic_on_unrecoverable_mf),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
}
};
@@ -1298,6 +1309,11 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
pr_err("%#lx: recovery action for %s: %s\n",
pfn, action_page_types[type], action_name[result]);
+ if (sysctl_panic_on_unrecoverable_mf && result == MF_IGNORED &&
+ (type == MF_MSG_KERNEL || type == MF_MSG_KERNEL_HIGH_ORDER ||
+ type == MF_MSG_UNKNOWN))
+ panic("Memory failure: %#lx: unrecoverable page", pfn);
+
return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v2 3/3] Documentation: document panic_on_unrecoverable_memory_failure sysctl
2026-03-31 11:00 [PATCH v2 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-03-31 11:00 ` [PATCH v2 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-03-31 11:00 ` [PATCH v2 2/3] mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl Breno Leitao
@ 2026-03-31 11:00 ` Breno Leitao
2 siblings, 0 replies; 4+ messages in thread
From: Breno Leitao @ 2026-03-31 11:00 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan
Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team
Document the new vm.panic_on_unrecoverable_memory_failure sysctl in the
admin guide, following the same format as panic_on_unrecovered_nmi.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Documentation/admin-guide/sysctl/vm.rst | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c..a811f503bca6 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
- page-cluster
- page_lock_unfairness
- panic_on_oom
+- panic_on_unrecoverable_memory_failure
- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
@@ -925,6 +926,32 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits an in-use kernel
+page that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation. This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+Pages that reach this path include slab objects (dentry cache, inode
+cache, etc.), page tables, kernel stacks, and other kernel allocations
+that lack the reverse mapping needed to isolate all references.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately. If the ``panic`` sysctl is also non-zero then the
+ machine will be rebooted.
+= =====================================================================
+
+
percpu_pagelist_high_fraction
=============================
--
2.52.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-03-31 11:00 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 11:00 [PATCH v2 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-03-31 11:00 ` [PATCH v2 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-03-31 11:00 ` [PATCH v2 2/3] mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-03-31 11:00 ` [PATCH v2 3/3] Documentation: document " Breno Leitao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox