From: Breno Leitao <leitao@debian.org>
To: Miaohe Lin <linmiaohe@huawei.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Shuah Khan <shuah@kernel.org>,
Naoya Horiguchi <nao.horiguchi@gmail.com>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
"Liam R. Howlett" <liam@infradead.org>,
"Liam R. Howlett" <liam@infradead.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
Breno Leitao <leitao@debian.org>,
linux-trace-kernel@vger.kernel.org, kernel-team@meta.com,
Lance Yang <lance.yang@linux.dev>
Subject: [PATCH v7 0/6] mm/memory-failure: add panic option for unrecoverable pages
Date: Wed, 13 May 2026 08:39:31 -0700 [thread overview]
Message-ID: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org> (raw)
A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running. The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash. In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.
This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context. The default is disabled so existing workloads see no
change.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
get_any_page() and surface it via -ENOTRECOVERABLE, per David
Hildenbrand's and Lance Yang's review of v6. This drops the
is_reserved snapshot in memory_failure() and the mf_get_page_status
enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
matching tracepoint string; the enum now covers both PG_reserved
pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
and 2/4 ("classify get_any_page() failures by reason") into a
single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
(David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org
Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org
Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
unrecoverable kernel page hwpoison events (reserved pages, refcount-0
non-buddy pages, unknown state), with a recheck to avoid racing with
concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
and add a selftest verifying SIGBUS recovery on userspace pages still
works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org
Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
get_hwpoison_page() and is_free_buddy_page()) cannot cause false
positives: PG_hwpoison is set beforehand and check_new_page() in the
page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
panic conditions (shared path with transient races and non-reserved
kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org
Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
how the retry mechanism mitigates false positives from buddy allocator
races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org
Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org
To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Breno Leitao (6):
mm/memory-failure: drop dead error_states[] entry for reserved pages
mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()
mm/memory-failure: add panic option for unrecoverable pages
Documentation: document panic_on_unrecoverable_memory_failure sysctl
Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++
mm/memory-failure.c | 85 +++++++++++++++++++++++++--------
2 files changed, 146 insertions(+), 19 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260323-ecc_panic-4e473b83087c
Best regards,
--
Breno Leitao <leitao@debian.org>
next reply other threads:[~2026-05-13 15:39 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 15:39 Breno Leitao [this message]
2026-05-13 15:39 ` [PATCH v7 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages Breno Leitao
2026-05-13 20:10 ` David Hildenbrand (Arm)
2026-05-14 10:55 ` Breno Leitao
2026-05-14 9:12 ` Lance Yang
2026-05-15 2:48 ` Miaohe Lin
2026-05-13 15:39 ` [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE Breno Leitao
2026-05-14 13:28 ` Lance Yang
2026-05-14 14:37 ` Breno Leitao
2026-05-15 7:03 ` Lance Yang
2026-05-15 3:04 ` Miaohe Lin
2026-05-13 15:39 ` [PATCH v7 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages Breno Leitao
2026-05-13 15:39 ` [PATCH v7 4/6] mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page() Breno Leitao
2026-05-13 19:49 ` David Hildenbrand (Arm)
2026-05-14 11:06 ` Breno Leitao
2026-05-13 15:39 ` [PATCH v7 5/6] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-05-13 15:39 ` [PATCH v7 6/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260513-ecc_panic-v7-0-be2e578e61da@debian.org \
--to=leitao@debian.org \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=david@kernel.org \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linmiaohe@huawei.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=nao.horiguchi@gmail.com \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=shuah@kernel.org \
--cc=skhan@linuxfoundation.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.