All of lore.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Miaohe Lin <linmiaohe@huawei.com>,
	 Naoya Horiguchi <nao.horiguchi@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,  Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
	 Breno Leitao <leitao@debian.org>,
	kernel-team@meta.com
Subject: [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages
Date: Fri, 24 Apr 2026 05:24:00 -0700	[thread overview]
Message-ID: <20260424-ecc_panic-v5-2-a35f4b50425c@debian.org> (raw)
In-Reply-To: <20260424-ecc_panic-v5-0-a35f4b50425c@debian.org>

Add a sysctl panic_on_unrecoverable_memory_failure that triggers a
kernel panic when memory_failure() encounters pages that cannot be
recovered. This provides a clean crash with useful debug information
rather than allowing silent data corruption or a delayed crash at an
unrelated code path.

The panic is triggered for three categories of unrecoverable failures,
all requiring result == MF_IGNORED:

- MF_MSG_KERNEL: reserved pages identified via PageReserved.

- MF_MSG_KERNEL_HIGH_ORDER: pages that get_hwpoison_page() observed
  with refcount 0 but that are not in the buddy allocator (e.g. tail
  pages of a high-order kernel allocation). A buddy page being
  concurrently allocated to userspace can briefly land on this branch
  too — its refcount is 0 inside the allocator and it is no longer on
  the buddy free list — and panicking on such a page would defeat the
  standard SIGBUS recovery path. The page allocator cannot reject
  hwpoisoned buddy pages reliably either: check_new_pages() is gated by
  is_check_pages_enabled() and is a no-op when CONFIG_DEBUG_VM=n.

  Rule out the race inside panic_on_unrecoverable_mf(): yield with
  cpu_relax() so a concurrent allocator on another CPU can finish
  prep_new_page() and have its writes become visible, then re-check.
  A genuine high-order kernel tail page stays unowned (refcount 0,
  no LRU, no mapping, not in buddy); an in-flight allocation will
  have bumped the refcount, attached a mapping, or placed the page
  on an LRU by then. Only panic if the recheck still observes a
  fully unowned page. The window is narrowed, not eliminated, but
  is far below any allocator path's cost.

- MF_MSG_UNKNOWN: pages that do not match any known recoverable state
  in error_states[]. A theoretical false positive from concurrent LRU
  isolation is mitigated by identify_page_state()'s two-pass design
  which rechecks using saved page_flags.

MF_MSG_GET_HWPOISON is intentionally excluded: it covers both
non-reserved kernel memory (SLAB/SLUB, vmalloc, kernel stacks, page
tables) and transient refcount races, so panicking would risk false
positives.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7b67e43dafbd1..fd1aed1af94a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -1281,6 +1292,75 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
+/*
+ * Determine whether to panic on an unrecoverable memory failure.
+ *
+ * Panics on three categories of failures (all requiring result == MF_IGNORED):
+ *
+ * - MF_MSG_KERNEL: Reserved pages (PageReserved) that belong to the kernel.
+ *
+ * - MF_MSG_KERNEL_HIGH_ORDER: Pages that get_hwpoison_page() observed with
+ *   refcount 0 but that are not in the buddy allocator (e.g. tail pages of
+ *   a high-order kernel allocation). A buddy page being concurrently
+ *   allocated could also reach this branch — its refcount is briefly 0
+ *   inside the allocator and it is no longer on the buddy free list — and
+ *   such a page may be destined for userspace, where the standard hwpoison
+ *   path would recover it via SIGBUS. The page allocator cannot reject
+ *   hwpoisoned buddy pages reliably either: check_new_pages() is gated by
+ *   is_check_pages_enabled() and is a no-op when CONFIG_DEBUG_VM=n. The
+ *   recheck below rules out this race before panicking.
+ *
+ * - MF_MSG_UNKNOWN: Pages that reached identify_page_state() but matched no
+ *   recoverable state in error_states[]. A theoretical false positive from
+ *   concurrent LRU isolation is mitigated by identify_page_state()'s
+ *   two-pass design which rechecks using saved page_flags.
+ *
+ * MF_MSG_GET_HWPOISON is intentionally excluded: it covers dynamically
+ * allocated kernel memory (SLAB/SLUB, vmalloc, kernel stacks, page tables)
+ * which shares the return path with transient refcount races, so panicking
+ * would risk false positives.
+ */
+static bool panic_on_unrecoverable_mf(unsigned long pfn,
+				      enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	struct page *p;
+
+	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
+		return false;
+
+	switch (type) {
+	case MF_MSG_KERNEL:
+	case MF_MSG_UNKNOWN:
+		return true;
+	case MF_MSG_KERNEL_HIGH_ORDER:
+		/*
+		 * Rule out a concurrent buddy allocation: give the
+		 * allocator a moment to finish prep_new_page() and
+		 * re-check. A genuine high-order kernel tail page stays
+		 * unowned; an in-flight allocation will have bumped the
+		 * refcount, attached a mapping, or placed the page on
+		 * an LRU by now.
+		 */
+		p = pfn_to_online_page(pfn);
+		if (!p)
+			return true;
+		/*
+		 * Yield so a concurrent allocator on another CPU can
+		 * finish prep_new_page() and have its writes become
+		 * visible before we resample the page state.
+		 */
+		cpu_relax();
+		return page_count(p) == 0 &&
+		       !PageLRU(p) &&
+		       !page_mapped(p) &&
+		       !page_folio(p)->mapping &&
+		       !is_free_buddy_page(p);
+	default:
+		return false;
+	}
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1298,6 +1378,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
+	if (panic_on_unrecoverable_mf(pfn, type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 
@@ -2428,6 +2511,14 @@ int memory_failure(unsigned long pfn, int flags)
 			}
 			res = action_result(pfn, MF_MSG_BUDDY, res);
 		} else {
+			/*
+			 * The page has refcount 0 but is not in the buddy
+			 * allocator — typically a tail page of a high-order
+			 * kernel allocation. A buddy page being concurrently
+			 * allocated to userspace can also briefly land here;
+			 * panic_on_unrecoverable_mf() rechecks to rule that
+			 * out before triggering a panic.
+			 */
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;

-- 
2.52.0


  parent reply	other threads:[~2026-04-24 12:24 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-24 12:23 ` [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-04-27 12:33   ` Lance Yang
2026-04-27 14:45     ` Breno Leitao
2026-04-27 15:14       ` Lance Yang
2026-04-27 15:57     ` Lance Yang
2026-04-24 12:24 ` Breno Leitao [this message]
2026-04-27 15:49   ` [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages David Hildenbrand (Arm)
2026-04-28  3:07     ` Lance Yang
2026-05-06 16:18       ` Breno Leitao
2026-05-10 14:42         ` Lance Yang
2026-05-11 14:44           ` Breno Leitao
2026-04-24 12:24 ` [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-04-24 12:48   ` Andrew Morton
2026-05-06 15:38     ` Breno Leitao
2026-04-24 12:24 ` [PATCH v5 4/4] selftests/mm: regression test for panic_on_unrecoverable_memory_failure Breno Leitao
2026-04-28  2:22   ` Miaohe Lin
2026-04-24 13:19 ` [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Matthew Wilcox
2026-04-24 14:39   ` Breno Leitao
2026-04-24 13:28 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260424-ecc_panic-v5-2-a35f4b50425c@debian.org \
    --to=leitao@debian.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.