From: Breno Leitao <leitao@debian.org>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com, david@kernel.org,
lance.yang@linux.dev, akpm@linux-foundation.org,
baoquan.he@linux.dev, rppt@kernel.org, pratyush@kernel.org
Cc: kexec@lists.infradead.org, linux-mm@kvack.org, rneu@meta.com,
riel@surriel.com, caggio@meta.com, kas@kernel.org
Subject: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
Date: Wed, 24 Jun 2026 03:39:38 -0700 [thread overview]
Message-ID: <ajut_LDQGYCShApx@gmail.com> (raw)
TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
next kernel doesn't hand out known-bad RAM.
The problem
===========
When a page is hard-offlined due to an uncorrectable memory error (multi bit
ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
that lives in the running kernel's data structures, not in the hardware.
A kexec replaces the kernel image but not the physical DRAM. The next kernel
rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
ordinary system RAM, and the buddy allocator hands it back out on the next
kernel. The known-bad cell is silently back in circulation, and the next
access faults again - potentially in a context that is harder to recover than
the original (e.g. a kernel allocation rather than a killable user page).
This matters most where kexec is frequent and machines are long-lived: live
kernel update on large fleets. Poison knowledge accumulates over uptime and is
thrown away on every update.
This is the case at Meta and in many hyperscalers.
Possible solutions
==================
1. Do nothing (status quo). The next kernel hands out known-bad
RAM, and hope for the best.
2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
frame would simply never become RAM (no allocator race at all).
But: it is x86-only (no arm64 equivalent in the same mechanism;
this series is tested on arm64);
3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
device poison lists). This is the correct layer for *cross power
cycle* persistence and is complementary to this work. But it is
per-platform, out of OS control, not universally available, and
cannot carry OS-discovered or software-simulated poison
(MADV_HWPOISON, the injector). kexec can also happen long before
firmware retirement takes effect.
4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
for the next kexec kernel cmdline.
5. A bespoke kexec segment / setup_data blob. This reinvents what
KHO already provides - preserved memory plus an FDT handoff to
the next kernel - which is the upstream-blessed generic mechanism
for exactly this kind of state.
This PoC
========
* Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
HandOver) to carry the poison list between kernels.
* Producer: hooks num_poisoned_pages_inc()/_sub() - the single
chokepoint for every poison/unpoison event - and records each
poisoned PFN into a vmalloc array that KHO preserves across the
kexec, described by a small versioned "hwpoison" subtree.
* Consumer: early in the next boot (fs_initcall_sync, before the
buddy allocator has handed anything out) it restores that array
and re-runs memory_failure() on each PFN, re-offlining the frame
and rebuilding the full hwpoison state (PG_hwpoison, counters,
HardwareCorrupted).
* The replay feeds back through the producer, so the list
re-publishes itself and survives an arbitrary chain of kexecs.
Open questions
==============
* Is there any alternative I am not seeing?
* Is a dedicated "hwpoison" subtree the right granularity, or
should this live under a broader RAS/KHO umbrella?
* Trusting the inherited list: should the next kernel bound the count /
validate PFNs against its own memory map before replaying?
Limitations
===========
* Poison events before KHO init (fs_initcall) cannot be published;
academic in practice as MCEs do not fire that early.
* Per-page only. Cross-power-cycle retirement of a whole DIMM
is not covered.
I've got a PoC working, and it is available in here, in case you are interested
in the details I am playing with
https://github.com/leitao/linux/tree/b4/hwpoison
next reply other threads:[~2026-06-24 10:40 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-24 10:39 Breno Leitao [this message]
2026-06-24 12:04 ` mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Kiryl Shutsemau
2026-06-24 13:46 ` Pratyush Yadav
2026-06-24 15:21 ` Breno Leitao
2026-06-24 15:34 ` Kiryl Shutsemau
2026-06-24 13:40 ` Pratyush Yadav
2026-06-24 14:44 ` Rik van Riel
2026-06-24 15:17 ` Pratyush Yadav
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajut_LDQGYCShApx@gmail.com \
--to=leitao@debian.org \
--cc=akpm@linux-foundation.org \
--cc=baoquan.he@linux.dev \
--cc=caggio@meta.com \
--cc=david@kernel.org \
--cc=kas@kernel.org \
--cc=kexec@lists.infradead.org \
--cc=lance.yang@linux.dev \
--cc=linmiaohe@huawei.com \
--cc=linux-mm@kvack.org \
--cc=nao.horiguchi@gmail.com \
--cc=pratyush@kernel.org \
--cc=riel@surriel.com \
--cc=rneu@meta.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox