mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
@ 2026-06-24 10:39 Breno Leitao
  2026-06-24 12:04 ` Kiryl Shutsemau
  0 siblings, 1 reply; 2+ messages in thread
From: Breno Leitao @ 2026-06-24 10:39 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, pratyush
  Cc: kexec, linux-mm, rneu, riel, caggio, kas

TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
next kernel doesn't hand out known-bad RAM.

The problem
===========

When a page is hard-offlined due to an uncorrectable memory error (multi bit
ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
that lives in the running kernel's data structures, not in the hardware.

A kexec replaces the kernel image but not the physical DRAM. The next kernel
rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
ordinary system RAM, and the buddy allocator hands it back out on the next
kernel. The known-bad cell is silently back in circulation, and the next
access faults again - potentially in a context that is harder to recover than
the original (e.g. a kernel allocation rather than a killable user page).

This matters most where kexec is frequent and machines are long-lived: live
kernel update on large fleets. Poison knowledge accumulates over uptime and is
thrown away on every update.

This is the case at Meta and in many hyperscalers.

Possible solutions
==================

1. Do nothing (status quo). The next kernel hands out known-bad
   RAM, and hope for the best. 

2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
   frame would simply never become RAM (no allocator race at all).
   But: it is x86-only (no arm64 equivalent in the same mechanism;
   this series is tested on arm64);

3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
   device poison lists). This is the correct layer for *cross power
   cycle* persistence and is complementary to this work. But it is
   per-platform, out of OS control, not universally available, and
   cannot carry OS-discovered or software-simulated poison
   (MADV_HWPOISON, the injector). kexec can also happen long before
   firmware retirement takes effect.

4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
   for the next kexec kernel cmdline.

5. A bespoke kexec segment / setup_data blob. This reinvents what
   KHO already provides - preserved memory plus an FDT handoff to
   the next kernel - which is the upstream-blessed generic mechanism
   for exactly this kind of state.

This PoC
========

  * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
    HandOver) to carry the poison list between kernels.

  * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
    chokepoint for every poison/unpoison event - and records each
    poisoned PFN into a vmalloc array that KHO preserves across the
    kexec, described by a small versioned "hwpoison" subtree.

  * Consumer: early in the next boot (fs_initcall_sync, before the
    buddy allocator has handed anything out) it restores that array
    and re-runs memory_failure() on each PFN, re-offlining the frame
    and rebuilding the full hwpoison state (PG_hwpoison, counters,
    HardwareCorrupted).

  * The replay feeds back through the producer, so the list
    re-publishes itself and survives an arbitrary chain of kexecs.

Open questions
==============

  * Is there any alternative I am not seeing?

  * Is a dedicated "hwpoison" subtree the right granularity, or
    should this live under a broader RAS/KHO umbrella?

  * Trusting the inherited list: should the next kernel bound the count /
    validate PFNs against its own memory map before replaying?

Limitations
===========

  * Poison events before KHO init (fs_initcall) cannot be published;
    academic in practice as MCEs do not fire that early.

  * Per-page only. Cross-power-cycle retirement of a whole DIMM
    is not covered.

I've got a PoC working, and it is available in here, in case you are interested
in the details I am playing with

  https://github.com/leitao/linux/tree/b4/hwpoison

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
  2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
@ 2026-06-24 12:04 ` Kiryl Shutsemau
  0 siblings, 0 replies; 2+ messages in thread
From: Kiryl Shutsemau @ 2026-06-24 12:04 UTC (permalink / raw)
  To: Breno Leitao, Ard Biesheuvel
  Cc: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, pratyush, kexec, linux-mm, rneu, riel, caggio

On Wed, Jun 24, 2026 at 03:39:38AM -0700, Breno Leitao wrote:
>   * Consumer: early in the next boot (fs_initcall_sync, before the
>     buddy allocator has handed anything out) it restores that array
>     and re-runs memory_failure() on each PFN, re-offlining the frame
>     and rebuilding the full hwpoison state (PG_hwpoison, counters,
>     HardwareCorrupted).

fs_initcall_sync is not before buddy hands anything out - buddy has been
live since memblock_free_all() in start_kernel(), and every initcall before
this one has allocated freely. So this is recovery, not prevention: you may
be running memory_failure() against a frame already in use, possibly by a
kernel allocation.

Two windows are missed entirely:

  - memblock allocations between setup_arch() and memblock_free_all()
    (page tables, mem_map[], percpu) can land on the bad frame.

  - The kernel image itself: KASLR picks its location in the
    decompressor/stub, long before any initcall. The next kernel can end
    up running *on* the bad frame.

So I don't think this should be a memory_failure() replay. The frames need
to leave the next kernel's view at the memory-map level, before memblock
and KASLR.

> Possible solutions
> ==================
...
> 
> 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
>    frame would simply never become RAM (no allocator race at all).
>    But: it is x86-only (no arm64 equivalent in the same mechanism;
>    this series is tested on arm64);

(+Ard. I might get some details around EFI wrong.)

This isn't accurate, and I think it's the right direction for EFI
platforms. EFI_UNUSABLE_MEMORY is honored on both arches today, no new
consumer code:

  - arm64: reserve_regions() marks non-usable memory nomap.
  - x86: do_add_efi_memmap() maps it to E820_TYPE_UNUSABLE.

And it closes the KASLR window for free, because the image is only placed in
EFI_CONVENTIONAL_MEMORY on both (x86 process_efi_entries(), arm64
randomalloc.c). So the bad frame is invisible to both the allocator and
KASLR, which is exactly what fs_initcall_sync can't give you.

There's also LINUX_EFI_MEMRESERVE (efi_mem_reserve_persistent()) -
cross-arch, reserved pre-buddy in efi_init() - and looks otherwise fine, but
it's parsed too late to keep KASLR off the frame.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-24 12:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
2026-06-24 12:04 ` Kiryl Shutsemau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox