public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH 0/2] mm/memory-failure: add panic option for unrecoverable pages
@ 2026-03-23 15:29 Breno Leitao
  2026-03-23 15:29 ` [PATCH 1/2] mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl Breno Leitao
  2026-03-23 15:29 ` [PATCH 2/2] Documentation: document " Breno Leitao
  0 siblings, 2 replies; 7+ messages in thread
From: Breno Leitao @ 2026-03-23 15:29 UTC (permalink / raw)
  To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan
  Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team

When the memory failure handler encounters an in-use kernel page that it
cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
currently logs the error as "Ignored" and continues operation.

This leaves corrupted data accessible to the kernel, which will inevitably
cause either silent data corruption or a delayed crash when the poisoned memory
is next accessed.

This is a common problem on large fleets. We frequently observe multi-bit ECC
errors hitting kernel slab pages, where memory_failure() fails to recover them
and the system crashes later at an unrelated code path, making root cause
analysis unnecessarily difficult.

Here is one specific example from production on an arm64 server: a multi-bit
ECC error hit a dentry cache slab page, memory_failure() failed to recover it
(slab pages are not supported by the hwpoison recovery mechanism), and 67
seconds later d_lookup() accessed the poisoned cache line causing a synchronous
external abort:

    [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
    [88690.498473] Memory failure: 0x40272d: unhandlable page.
    [88690.498619] Memory failure: 0x40272d: recovery action for
                   get hwpoison page: Ignored
    ...
    [88757.847126] Internal error: synchronous external abort:
                   0000000096000410 [#1] SMP
    [88758.061075] pc : d_lookup+0x5c/0x220

This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
(default 0) that, when enabled, panics immediately on unrecoverable
memory failures. This provides a clean crash dump at the time of the
error, which is far more useful for diagnosis than a random crash later
at an unrelated code path.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
      mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl
      Documentation: document panic_on_unrecoverable_memory_failure sysctl

 Documentation/admin-guide/sysctl/vm.rst | 27 +++++++++++++++++++++++++++
 mm/memory-failure.c                     | 15 +++++++++++++++
 2 files changed, 42 insertions(+)
---
base-commit: 63f5f5ffdf63d9c75a438c92be58177744b4c69c
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
--  
Breno Leitao <leitao@debian.org>



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-24 16:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23 15:29 [PATCH 0/2] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-03-23 15:29 ` [PATCH 1/2] mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-03-23 15:29 ` [PATCH 2/2] Documentation: document " Breno Leitao
2026-03-23 16:51   ` Randy Dunlap
2026-03-24 10:09     ` Breno Leitao
2026-03-24 11:48       ` Akira Yokosawa
2026-03-24 16:27         ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox