mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]

Kexec Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Breno Leitao <leitao@debian.org>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com, david@kernel.org,
	 lance.yang@linux.dev, akpm@linux-foundation.org,
	baoquan.he@linux.dev, rppt@kernel.org,  pratyush@kernel.org
Cc: kexec@lists.infradead.org, linux-mm@kvack.org, rneu@meta.com,
	 riel@surriel.com, caggio@meta.com, kas@kernel.org
Subject: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
Date: Wed, 24 Jun 2026 03:39:38 -0700	[thread overview]
Message-ID: <ajut_LDQGYCShApx@gmail.com> (raw)

TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
next kernel doesn't hand out known-bad RAM.

The problem
===========

When a page is hard-offlined due to an uncorrectable memory error (multi bit
ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
that lives in the running kernel's data structures, not in the hardware.

A kexec replaces the kernel image but not the physical DRAM. The next kernel
rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
ordinary system RAM, and the buddy allocator hands it back out on the next
kernel. The known-bad cell is silently back in circulation, and the next
access faults again - potentially in a context that is harder to recover than
the original (e.g. a kernel allocation rather than a killable user page).

This matters most where kexec is frequent and machines are long-lived: live
kernel update on large fleets. Poison knowledge accumulates over uptime and is
thrown away on every update.

This is the case at Meta and in many hyperscalers.

Possible solutions
==================

1. Do nothing (status quo). The next kernel hands out known-bad
   RAM, and hope for the best. 

2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
   frame would simply never become RAM (no allocator race at all).
   But: it is x86-only (no arm64 equivalent in the same mechanism;
   this series is tested on arm64);

3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
   device poison lists). This is the correct layer for *cross power
   cycle* persistence and is complementary to this work. But it is
   per-platform, out of OS control, not universally available, and
   cannot carry OS-discovered or software-simulated poison
   (MADV_HWPOISON, the injector). kexec can also happen long before
   firmware retirement takes effect.

4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
   for the next kexec kernel cmdline.

5. A bespoke kexec segment / setup_data blob. This reinvents what
   KHO already provides - preserved memory plus an FDT handoff to
   the next kernel - which is the upstream-blessed generic mechanism
   for exactly this kind of state.

This PoC
========

  * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
    HandOver) to carry the poison list between kernels.

  * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
    chokepoint for every poison/unpoison event - and records each
    poisoned PFN into a vmalloc array that KHO preserves across the
    kexec, described by a small versioned "hwpoison" subtree.

  * Consumer: early in the next boot (fs_initcall_sync, before the
    buddy allocator has handed anything out) it restores that array
    and re-runs memory_failure() on each PFN, re-offlining the frame
    and rebuilding the full hwpoison state (PG_hwpoison, counters,
    HardwareCorrupted).

  * The replay feeds back through the producer, so the list
    re-publishes itself and survives an arbitrary chain of kexecs.

Open questions
==============

  * Is there any alternative I am not seeing?

  * Is a dedicated "hwpoison" subtree the right granularity, or
    should this live under a broader RAS/KHO umbrella?

  * Trusting the inherited list: should the next kernel bound the count /
    validate PFNs against its own memory map before replaying?

Limitations
===========

  * Poison events before KHO init (fs_initcall) cannot be published;
    academic in practice as MCEs do not fire that early.

  * Per-page only. Cross-power-cycle retirement of a whole DIMM
    is not covered.

I've got a PoC working, and it is available in here, in case you are interested
in the details I am playing with

  https://github.com/leitao/linux/tree/b4/hwpoison

next             reply	other threads:[~2026-06-24 10:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 10:39 Breno Leitao [this message]
2026-06-24 12:04 ` mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Kiryl Shutsemau
2026-06-24 13:46   ` Pratyush Yadav
2026-06-24 15:21   ` Breno Leitao
2026-06-24 15:34     ` Kiryl Shutsemau
2026-06-24 13:40 ` Pratyush Yadav
2026-06-24 14:44   ` Rik van Riel
2026-06-24 15:17     ` Pratyush Yadav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajut_LDQGYCShApx@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=baoquan.he@linux.dev \
    --cc=caggio@meta.com \
    --cc=david@kernel.org \
    --cc=kas@kernel.org \
    --cc=kexec@lists.infradead.org \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=nao.horiguchi@gmail.com \
    --cc=pratyush@kernel.org \
    --cc=riel@surriel.com \
    --cc=rneu@meta.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox