Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
@ 2026-06-24 10:39 Breno Leitao
  2026-06-24 12:04 ` Kiryl Shutsemau
  2026-06-24 13:40 ` Pratyush Yadav
  0 siblings, 2 replies; 5+ messages in thread
From: Breno Leitao @ 2026-06-24 10:39 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, pratyush
  Cc: kexec, linux-mm, rneu, riel, caggio, kas

TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
next kernel doesn't hand out known-bad RAM.

The problem
===========

When a page is hard-offlined due to an uncorrectable memory error (multi bit
ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
that lives in the running kernel's data structures, not in the hardware.

A kexec replaces the kernel image but not the physical DRAM. The next kernel
rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
ordinary system RAM, and the buddy allocator hands it back out on the next
kernel. The known-bad cell is silently back in circulation, and the next
access faults again - potentially in a context that is harder to recover than
the original (e.g. a kernel allocation rather than a killable user page).

This matters most where kexec is frequent and machines are long-lived: live
kernel update on large fleets. Poison knowledge accumulates over uptime and is
thrown away on every update.

This is the case at Meta and in many hyperscalers.


Possible solutions
==================

1. Do nothing (status quo). The next kernel hands out known-bad
   RAM, and hope for the best. 

2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
   frame would simply never become RAM (no allocator race at all).
   But: it is x86-only (no arm64 equivalent in the same mechanism;
   this series is tested on arm64);

3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
   device poison lists). This is the correct layer for *cross power
   cycle* persistence and is complementary to this work. But it is
   per-platform, out of OS control, not universally available, and
   cannot carry OS-discovered or software-simulated poison
   (MADV_HWPOISON, the injector). kexec can also happen long before
   firmware retirement takes effect.

4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
   for the next kexec kernel cmdline.

5. A bespoke kexec segment / setup_data blob. This reinvents what
   KHO already provides - preserved memory plus an FDT handoff to
   the next kernel - which is the upstream-blessed generic mechanism
   for exactly this kind of state.

This PoC
========

  * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
    HandOver) to carry the poison list between kernels.

  * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
    chokepoint for every poison/unpoison event - and records each
    poisoned PFN into a vmalloc array that KHO preserves across the
    kexec, described by a small versioned "hwpoison" subtree.

  * Consumer: early in the next boot (fs_initcall_sync, before the
    buddy allocator has handed anything out) it restores that array
    and re-runs memory_failure() on each PFN, re-offlining the frame
    and rebuilding the full hwpoison state (PG_hwpoison, counters,
    HardwareCorrupted).

  * The replay feeds back through the producer, so the list
    re-publishes itself and survives an arbitrary chain of kexecs.


Open questions
==============

  * Is there any alternative I am not seeing?

  * Is a dedicated "hwpoison" subtree the right granularity, or
    should this live under a broader RAS/KHO umbrella?

  * Trusting the inherited list: should the next kernel bound the count /
    validate PFNs against its own memory map before replaying?

Limitations
===========

  * Poison events before KHO init (fs_initcall) cannot be published;
    academic in practice as MCEs do not fire that early.

  * Per-page only. Cross-power-cycle retirement of a whole DIMM
    is not covered.

I've got a PoC working, and it is available in here, in case you are interested
in the details I am playing with

  https://github.com/leitao/linux/tree/b4/hwpoison


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-24 14:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
2026-06-24 12:04 ` Kiryl Shutsemau
2026-06-24 13:46   ` Pratyush Yadav
2026-06-24 13:40 ` Pratyush Yadav
2026-06-24 14:44   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox