Linux Kernel Selftest development
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Miaohe Lin <linmiaohe@huawei.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	 Lorenzo Stoakes <ljs@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	 Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	 Michal Hocko <mhocko@suse.com>, Shuah Khan <shuah@kernel.org>,
	 Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	lance.yang@linux.dev,  Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
	 linux-trace-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH v9 0/6] mm/memory-failure: add panic option for unrecoverable pages
Date: Wed, 17 Jun 2026 02:40:06 -0700	[thread overview]
Message-ID: <ajJq-uVRBH5SBkwK@gmail.com> (raw)
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

On Tue, Jun 09, 2026 at 03:56:54AM -0700, Breno Leitao wrote:
> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running.  The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash.  In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
> 
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context.  The default is disabled so existing workloads see no
> change.
> 
> There is a selftest that test different cases, and I tested it using
> the following variants:
> 
>   ┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
>   │ Variant │   PFN    │                          Result                           │
>   ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
>   │ rodata  │ 0x2600   │ Panic with "Memory failure: 0x2600: unrecoverable page"   │
>   ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
>   │ slab    │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
>   ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
>   │ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
>   └─────────┴──────────┴───────────────────────────────────────────────────────────┘
> 
> Each one shows the same call trace, exactly the path the series builds:
> 
>   hard_offline_page_store
>     → memory_failure
>       → action_result
>         → panic("Memory failure: %#lx: unrecoverable page")

Debugging another issue earlier today, just found a kernel crash that is
hitting a ignored page later in the day, and randomly misbehaving/crashing.

 Memory failure: 0x140ae: unhandlable page.
 Memory failure: 0x140ae: recovery action for get hwpoison page: Ignored                     <-- Ignored 
 loop0: detected capacity change from 0 to 15241056
 EDAC MC0: 1 UE multi-bit ECC on LP5x_0 LP5x_0 (node:0 card:0 module:0 rank:0 bank:2 device:28 row:42700 column:96
 {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 308
 {3}[Hardware Error]: event severity: recoverable
 {3}[Hardware Error]:  imprecise tstamp: 2026-06-16 02:50:03
 {3}[Hardware Error]:  Error 0, type: recoverable
 {3}[Hardware Error]:   section_type: memory error
 {3}[Hardware Error]:   physical_address: 0x0000000aeccde180
 {3}[Hardware Error]:   physical_address_mask: 0xfffffffffffff000
 {3}[Hardware Error]:   node:0 card:0 module:0 rank:0 bank:2 device:28 row:42700 column:960 requestor_id:0x0000000
 {3}[Hardware Error]:   error_type: 3, multi-bit ECC
 {3}[Hardware Error]:   DIMM location: LP5x_0 LP5x_0
 Memory failure: 0xaeccd: recovery action for dirty LRU page: Recovered

 Internal error: synchronous external abort: 0000000096000410 [#1]  SMP
 Modules linked in: ghes_edac(E) squashfs(E) act_gact(E) sch_fq(E) tcp_diag(E) inet_diag(E) cls_bpf(E) evdev(E) sm
 CPU: 51 UID: 0 PID: 1 Comm: systemd Kdump: loaded Tainted: G   M       OE K     6.16.1-0_fbk2_0_gf40efc324cc8 #1
 Tainted: [M]=MACHINE_CHECK, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [K]=LIVEPATCH
 pstate: 834010c9 (Nzcv daIF +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
 pc : clear_inode+0x34/0x108
 lr : proc_evict_inode.llvm.1771226604092943895+0x28/0x68
 sp : ffff800083f6f8d0
 x29: ffff800083f6f8e0 x28: 0000000000000011 x27: ffff0000c1378788
 x26: ffffffffffffffff x25: ffff800082747de0 x24: ffff0000c0ae9898
 x23: ffff8000819155f8 x22: ffff0000c0ae9888 x21: ffff0000c0ae9808
 x20: ffff0000c0ae9818 x19: ffff0000c0ae9788 x18: 000000000000001c
 x17: 0000000000000018 x16: 0000000000000040 x15: 0000000000000000
 x14: 0000000000000001 x13: 0000000000000000 x12: 0000000000002710
 x11: ffff0000c0ae9898 x10: ffff0000c1299b58 x9 : 0000000000000001
 x8 : ffff0000c0ae9900 x7 : ffff8000828db000 x6 : 0000000000005040
 x5 : ffffffffffffffff x4 : ffffffdfc05c8aa0 x3 : ffff000126470000
 x2 : ffffffffffffffff x1 : 0000000000000000 x0 : ffff0000c0ae9788
 Call trace:
  clear_inode+0x34/0x108 (P)
  proc_evict_inode.llvm.1771226604092943895+0x28/0x68
  evict+0xec/0x328
  iput+0xa8/0x310
  dentry_unlink_inode+0xa4/0x188
  __dentry_kill+0x74/0x358
  shrink_dentry_list+0xc8/0x198
 ....

      parent reply	other threads:[~2026-06-17  9:40 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-09 10:56 [PATCH v9 0/6] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-06-09 10:56 ` [PATCH v9 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages Breno Leitao
2026-06-09 10:56 ` [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE Breno Leitao
2026-06-09 14:41   ` David Hildenbrand (Arm)
2026-06-09 16:15     ` Breno Leitao
2026-06-09 18:41       ` David Hildenbrand (Arm)
2026-06-10  9:53         ` Breno Leitao
2026-06-09 10:56 ` [PATCH v9 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages Breno Leitao
2026-06-09 10:56 ` [PATCH v9 4/6] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-06-09 10:56 ` [PATCH v9 5/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-06-09 10:57 ` [PATCH v9 6/6] selftests/mm: add hwpoison-panic destructive test Breno Leitao
2026-06-17  9:40 ` Breno Leitao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajJq-uVRBH5SBkwK@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=ljs@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox