* mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
@ 2026-06-24 10:39 Breno Leitao
2026-06-24 12:04 ` Kiryl Shutsemau
0 siblings, 1 reply; 2+ messages in thread
From: Breno Leitao @ 2026-06-24 10:39 UTC (permalink / raw)
To: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
rppt, pratyush
Cc: kexec, linux-mm, rneu, riel, caggio, kas
TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
next kernel doesn't hand out known-bad RAM.
The problem
===========
When a page is hard-offlined due to an uncorrectable memory error (multi bit
ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
that lives in the running kernel's data structures, not in the hardware.
A kexec replaces the kernel image but not the physical DRAM. The next kernel
rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
ordinary system RAM, and the buddy allocator hands it back out on the next
kernel. The known-bad cell is silently back in circulation, and the next
access faults again - potentially in a context that is harder to recover than
the original (e.g. a kernel allocation rather than a killable user page).
This matters most where kexec is frequent and machines are long-lived: live
kernel update on large fleets. Poison knowledge accumulates over uptime and is
thrown away on every update.
This is the case at Meta and in many hyperscalers.
Possible solutions
==================
1. Do nothing (status quo). The next kernel hands out known-bad
RAM, and hope for the best.
2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
frame would simply never become RAM (no allocator race at all).
But: it is x86-only (no arm64 equivalent in the same mechanism;
this series is tested on arm64);
3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
device poison lists). This is the correct layer for *cross power
cycle* persistence and is complementary to this work. But it is
per-platform, out of OS control, not universally available, and
cannot carry OS-discovered or software-simulated poison
(MADV_HWPOISON, the injector). kexec can also happen long before
firmware retirement takes effect.
4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
for the next kexec kernel cmdline.
5. A bespoke kexec segment / setup_data blob. This reinvents what
KHO already provides - preserved memory plus an FDT handoff to
the next kernel - which is the upstream-blessed generic mechanism
for exactly this kind of state.
This PoC
========
* Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
HandOver) to carry the poison list between kernels.
* Producer: hooks num_poisoned_pages_inc()/_sub() - the single
chokepoint for every poison/unpoison event - and records each
poisoned PFN into a vmalloc array that KHO preserves across the
kexec, described by a small versioned "hwpoison" subtree.
* Consumer: early in the next boot (fs_initcall_sync, before the
buddy allocator has handed anything out) it restores that array
and re-runs memory_failure() on each PFN, re-offlining the frame
and rebuilding the full hwpoison state (PG_hwpoison, counters,
HardwareCorrupted).
* The replay feeds back through the producer, so the list
re-publishes itself and survives an arbitrary chain of kexecs.
Open questions
==============
* Is there any alternative I am not seeing?
* Is a dedicated "hwpoison" subtree the right granularity, or
should this live under a broader RAS/KHO umbrella?
* Trusting the inherited list: should the next kernel bound the count /
validate PFNs against its own memory map before replaying?
Limitations
===========
* Poison events before KHO init (fs_initcall) cannot be published;
academic in practice as MCEs do not fire that early.
* Per-page only. Cross-power-cycle retirement of a whole DIMM
is not covered.
I've got a PoC working, and it is available in here, in case you are interested
in the details I am playing with
https://github.com/leitao/linux/tree/b4/hwpoison
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
@ 2026-06-24 12:04 ` Kiryl Shutsemau
0 siblings, 0 replies; 2+ messages in thread
From: Kiryl Shutsemau @ 2026-06-24 12:04 UTC (permalink / raw)
To: Breno Leitao, Ard Biesheuvel
Cc: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
rppt, pratyush, kexec, linux-mm, rneu, riel, caggio
On Wed, Jun 24, 2026 at 03:39:38AM -0700, Breno Leitao wrote:
> * Consumer: early in the next boot (fs_initcall_sync, before the
> buddy allocator has handed anything out) it restores that array
> and re-runs memory_failure() on each PFN, re-offlining the frame
> and rebuilding the full hwpoison state (PG_hwpoison, counters,
> HardwareCorrupted).
fs_initcall_sync is not before buddy hands anything out - buddy has been
live since memblock_free_all() in start_kernel(), and every initcall before
this one has allocated freely. So this is recovery, not prevention: you may
be running memory_failure() against a frame already in use, possibly by a
kernel allocation.
Two windows are missed entirely:
- memblock allocations between setup_arch() and memblock_free_all()
(page tables, mem_map[], percpu) can land on the bad frame.
- The kernel image itself: KASLR picks its location in the
decompressor/stub, long before any initcall. The next kernel can end
up running *on* the bad frame.
So I don't think this should be a memory_failure() replay. The frames need
to leave the next kernel's view at the memory-map level, before memblock
and KASLR.
> Possible solutions
> ==================
...
>
> 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
> frame would simply never become RAM (no allocator race at all).
> But: it is x86-only (no arm64 equivalent in the same mechanism;
> this series is tested on arm64);
(+Ard. I might get some details around EFI wrong.)
This isn't accurate, and I think it's the right direction for EFI
platforms. EFI_UNUSABLE_MEMORY is honored on both arches today, no new
consumer code:
- arm64: reserve_regions() marks non-usable memory nomap.
- x86: do_add_efi_memmap() maps it to E820_TYPE_UNUSABLE.
And it closes the KASLR window for free, because the image is only placed in
EFI_CONVENTIONAL_MEMORY on both (x86 process_efi_entries(), arm64
randomalloc.c). So the bad frame is invisible to both the allocator and
KASLR, which is exactly what fs_initcall_sync can't give you.
There's also LINUX_EFI_MEMRESERVE (efi_mem_reserve_persistent()) -
cross-arch, reserved pre-buddy in efi_init() - and looks otherwise fine, but
it's parsed too late to keep KASLR off the frame.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-06-24 12:04 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
2026-06-24 12:04 ` Kiryl Shutsemau
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox