Re: [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lance Yang <lance.yang@linux.dev>
To: leitao@debian.org
Cc: david@kernel.org, linmiaohe@huawei.com, nao.horiguchi@gmail.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	skhan@linuxfoundation.org, ljs@kernel.org,
	Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, shuah@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
	kernel-team@meta.com, Lance Yang <lance.yang@linux.dev>
Subject: Re: [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages
Date: Sun, 10 May 2026 22:42:20 +0800	[thread overview]
Message-ID: <20260510144220.92522-1-lance.yang@linux.dev> (raw)
In-Reply-To: <aftnSfb15G92JON5@gmail.com>


On Wed, May 06, 2026 at 09:18:12AM -0700, Breno Leitao wrote:
>On Tue, Apr 28, 2026 at 11:07:21AM +0800, Lance Yang wrote:
>> 
>> On Mon, Apr 27, 2026 at 05:49:28PM +0200, David Hildenbrand (Arm) wrote:
>> >> +	switch (type) {
>> >> +	case MF_MSG_KERNEL:
>> >> +	case MF_MSG_UNKNOWN:
>> >> +		return true;
>> >> +	case MF_MSG_KERNEL_HIGH_ORDER:
>> >> +		/*
>> >> +		 * Rule out a concurrent buddy allocation: give the
>> >> +		 * allocator a moment to finish prep_new_page() and
>> >> +		 * re-check. A genuine high-order kernel tail page stays
>> >> +		 * unowned; an in-flight allocation will have bumped the
>> >> +		 * refcount, attached a mapping, or placed the page on
>> >> +		 * an LRU by now.
>> >> +		 */
>> >> +		p = pfn_to_online_page(pfn);
>> >> +		if (!p)
>> >> +			return true;
>> >> +		/*
>> >> +		 * Yield so a concurrent allocator on another CPU can
>> >> +		 * finish prep_new_page() and have its writes become
>> >> +		 * visible before we resample the page state.
>> >> +		 */
>> >> +		cpu_relax();
>> >> +		return page_count(p) == 0 &&
>> >> +		       !PageLRU(p) &&
>> >> +		       !page_mapped(p) &&
>> >> +		       !page_folio(p)->mapping &&
>> >> +		       !is_free_buddy_page(p);
>> >
>> >I don't get what you are doing here. The right way to check for a tail page is
>> >not by checking the refcount.
>> >
>> >Further, you are not holding a folio reference? If so, calling
>> >page_mapped/folio_mapped is shaky. On concurrent folio split you can trigger a
>> >VM_WARN_ON_FOLIO().
>> >
>> >
>> >Maybe folio_snapshot() is what you are looking for, if you are in fact not
>> >holding a reference?
>> 
>> Right! Maybe we should not try to make this decision in
>> panic_on_unrecoverable_mf().
>> 
>> By the time we get here, we only know the final MF_MSG_* type. The
>> real reason why get_hwpoison_page() failed is already lost.
>> 
>> Wonder if it would be better to split that earlier, around
>> __get_unpoison_page()/get_any_page(). That code still knows why
>> grabbing the page failed, either an unsupported kernel page or
>> just a temporary race we cannot really trust :)
>> 
>> Then the later panic logic can be simple: panic for the stable
>> unsupported kernel page case, and not for the temporary race case.
>> 
>> That would also avoid trying to guess MF_MSG_KERNEL_HIGH_ORDER here:)
>
>This is a very good feedback, and definitely what I wanted to do, but,
>failed. Once we have the reason, we don't need this dance to guess the
>reason.
>
>I've hacked a patch based on this approach. How does it sound?

Yes. This direction makes sense to me, not an expert though :D

I played with something similar (untested) on top of patch #01:

---8<---
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 432d5f996c64..a2799f063913 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;

 static int sysctl_enable_soft_offline __read_mostly = 1;

+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);

 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };

@@ -1281,6 +1292,18 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }

+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
+		return false;
+
+	if (type == MF_MSG_KERNEL)
+		return true;
+
+	return false;
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1298,6 +1321,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);

+	if (panic_on_unrecoverable_mf(type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }

@@ -1389,11 +1415,27 @@ static int __get_hwpoison_page(struct page *page, unsigned long flags)

 #define GET_PAGE_MAX_RETRY_NUM 3

-static int get_any_page(struct page *p, unsigned long flags)
+enum mf_get_page_status {
+	MF_GET_PAGE_OK = 0,
+	MF_GET_PAGE_RACE,
+	MF_GET_PAGE_UNHANDLABLE,
+};
+
+static void set_mf_get_page_status(enum mf_get_page_status *gp_status,
+				   enum mf_get_page_status value)
+{
+	if (gp_status)
+		*gp_status = value;
+}
+
+static int get_any_page(struct page *p, unsigned long flags,
+			enum mf_get_page_status *gp_status)
 {
 	int ret = 0, pass = 0;
 	bool count_increased = false;

+	set_mf_get_page_status(gp_status, MF_GET_PAGE_OK);
+
 	if (flags & MF_COUNT_INCREASED)
 		count_increased = true;

@@ -1406,11 +1448,13 @@ static int get_any_page(struct page *p, unsigned long flags)
 				if (pass++ < GET_PAGE_MAX_RETRY_NUM)
 					goto try_again;
 				ret = -EBUSY;
+				set_mf_get_page_status(gp_status, MF_GET_PAGE_RACE);
 			} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
 				/* We raced with put_page, retry. */
 				if (pass++ < GET_PAGE_MAX_RETRY_NUM)
 					goto try_again;
 				ret = -EIO;
+				set_mf_get_page_status(gp_status, MF_GET_PAGE_RACE);
 			}
 			goto out;
 		} else if (ret == -EBUSY) {
@@ -1423,6 +1467,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 				goto try_again;
 			}
 			ret = -EIO;
+			set_mf_get_page_status(gp_status, MF_GET_PAGE_UNHANDLABLE);
 			goto out;
 		}
 	}
@@ -1442,6 +1487,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 		}
 		put_page(p);
 		ret = -EIO;
+		set_mf_get_page_status(gp_status, MF_GET_PAGE_UNHANDLABLE);
 	}
 out:
 	if (ret == -EIO)
@@ -1480,6 +1526,7 @@ static int __get_unpoison_page(struct page *page)
  * get_hwpoison_page() - Get refcount for memory error handling
  * @p:		Raw error page (hit by memory error)
  * @flags:	Flags controlling behavior of error handling
+ * @gp_status:	Optional output for the reason get_any_page() failed
  *
  * get_hwpoison_page() takes a page refcount of an error page to handle memory
  * error on it, after checking that the error page is in a well-defined state
@@ -1503,7 +1550,8 @@ static int __get_unpoison_page(struct page *page)
  *         operations like allocation and free,
  *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
  */
-static int get_hwpoison_page(struct page *p, unsigned long flags)
+static int get_hwpoison_page(struct page *p, unsigned long flags,
+			     enum mf_get_page_status *gp_status)
 {
 	int ret;

@@ -1511,7 +1559,7 @@ static int get_hwpoison_page(struct page *p, unsigned long flags)
 	if (flags & MF_UNPOISON)
 		ret = __get_unpoison_page(p);
 	else
-		ret = get_any_page(p, flags);
+		ret = get_any_page(p, flags, gp_status);
 	zone_pcp_enable(page_zone(p));

 	return ret;
@@ -2341,6 +2389,7 @@ static int memory_failure_pfn(unsigned long pfn, int flags)
  */
 int memory_failure(unsigned long pfn, int flags)
 {
+	enum mf_get_page_status gp_status = MF_GET_PAGE_OK;
 	struct page *p;
 	struct folio *folio;
 	struct dev_pagemap *pgmap;
@@ -2413,7 +2462,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
 	is_reserved = PageReserved(p);
-	res = get_hwpoison_page(p, flags);
+	res = get_hwpoison_page(p, flags, &gp_status);
 	if (!res) {
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
@@ -2437,9 +2486,13 @@ int memory_failure(unsigned long pfn, int flags)
 		/*
 		 * Pages with PG_reserved set are not currently managed by the
 		 * page allocator (memblock-reserved memory, driver reservations,
-		 * etc.), so classify them as kernel-owned for reporting.
+		 * etc.), so classify them as kernel-owned for reporting. Do the
+		 * same for pages that get_any_page() still cannot handle after
+		 * retries: likely non-LRU/non-buddy pages such as slab, kernel
+		 * stack, page table or vmalloc-backed pages. Transient lifecycle
+		 * races stay as MF_MSG_GET_HWPOISON.
 		 */
-		if (is_reserved)
+		if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
 			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
 		else
 			res = action_result(pfn, MF_MSG_GET_HWPOISON,
@@ -2744,7 +2797,7 @@ int unpoison_memory(unsigned long pfn)
 		goto unlock_mutex;
 	}

-	ghp = get_hwpoison_page(p, MF_UNPOISON);
+	ghp = get_hwpoison_page(p, MF_UNPOISON, NULL);
 	if (!ghp) {
 		if (folio_test_hugetlb(folio)) {
 			huge = true;
@@ -2951,7 +3004,7 @@ int soft_offline_page(unsigned long pfn, int flags)

 retry:
 	get_online_mems();
-	ret = get_hwpoison_page(page, flags | MF_SOFT_OFFLINE);
+	ret = get_hwpoison_page(page, flags | MF_SOFT_OFFLINE, NULL);
 	put_online_mems();

 	if (hwpoison_filter(page)) {
---

I would leave MF_MSG_KERNEL_HIGH_ORDER out for now. That path still
has the allocator race David pointed out, unless there is easy way to
rule that out ...

Also would leave MF_MSG_UNKNOWN out. We don't really know what it is no?
So it's not good basis for a panic decision :)

Maybe better to keep panic_on_unrecoverable_mf simple: classify the
get_any_page() failure reason earlier, but only panic on MF_MSG_KERNEL.

IMHO, making the knob too complicated for memory failures that should be
rare does not seem worth it. Just covering MF_MSG_KERNEL should already
help crash analysis a lot :)

Feel free to pick up any bits that look useful :)

Cheers, Lance

[...]

next prev parent reply	other threads:[~2026-05-10 14:42 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 12:23 [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-24 12:23 ` [PATCH v5 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-04-27 12:33   ` Lance Yang
2026-04-27 14:45     ` Breno Leitao
2026-04-27 15:14       ` Lance Yang
2026-04-27 15:57     ` Lance Yang
2026-04-24 12:24 ` [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-27 15:49   ` David Hildenbrand (Arm)
2026-04-28  3:07     ` Lance Yang
2026-05-06 16:18       ` Breno Leitao
2026-05-10 14:42         ` Lance Yang [this message]
2026-05-11 14:44           ` Breno Leitao
2026-04-24 12:24 ` [PATCH v5 3/4] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-04-24 12:48   ` Andrew Morton
2026-05-06 15:38     ` Breno Leitao
2026-04-24 12:24 ` [PATCH v5 4/4] selftests/mm: regression test for panic_on_unrecoverable_memory_failure Breno Leitao
2026-04-28  2:22   ` Miaohe Lin
2026-04-24 13:19 ` [PATCH v5 0/4] mm/memory-failure: add panic option for unrecoverable pages Matthew Wilcox
2026-04-24 14:39   ` Breno Leitao
2026-04-24 13:28 ` Andrew Morton

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:432d5f996c6 dfblob:a2799f06391 )
 OR (
bs:"Re: [PATCH v5 2/4] mm/memory-failure: add panic option for unrecoverable pages" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260510144220.92522-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.