Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]

Kexec Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Pratyush Yadav <pratyush@kernel.org>
To: Breno Leitao <leitao@debian.org>
Cc: nao.horiguchi@gmail.com,  linmiaohe@huawei.com,
	 david@kernel.org, lance.yang@linux.dev,
	 akpm@linux-foundation.org,  baoquan.he@linux.dev,
	rppt@kernel.org,  pratyush@kernel.org,
	 kexec@lists.infradead.org, linux-mm@kvack.org,  rneu@meta.com,
	 riel@surriel.com,  caggio@meta.com, kas@kernel.org
Subject: Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
Date: Wed, 24 Jun 2026 15:40:03 +0200	[thread overview]
Message-ID: <2vxzse6ckqfg.fsf@kernel.org> (raw)
In-Reply-To: <ajut_LDQGYCShApx@gmail.com> (Breno Leitao's message of "Wed, 24 Jun 2026 03:39:38 -0700")

On Wed, Jun 24 2026, Breno Leitao wrote:

> TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
> next kernel doesn't hand out known-bad RAM.
>
> The problem
> ===========
>
> When a page is hard-offlined due to an uncorrectable memory error (multi bit
> ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
> allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
> that lives in the running kernel's data structures, not in the hardware.
>
> A kexec replaces the kernel image but not the physical DRAM. The next kernel
> rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
> ordinary system RAM, and the buddy allocator hands it back out on the next
> kernel. The known-bad cell is silently back in circulation, and the next
> access faults again - potentially in a context that is harder to recover than
> the original (e.g. a kernel allocation rather than a killable user page).
>
> This matters most where kexec is frequent and machines are long-lived: live
> kernel update on large fleets. Poison knowledge accumulates over uptime and is
> thrown away on every update.

I don't know much about hardware page poisoning. From what I understand
of your description, it seems like the hardware fires off an event that
the kernel receives, and then the kernel stores the state, and hardware
"forgets" it. So what happens when something accesses the poisoned
memory anyway? Do we get another hwpoison event?

Also, what happens on cold reboot? If the HW does not remember bad
pages, won't the kernel be in the same position? How does it know the
bad pages on a cold boot?

>
> This is the case at Meta and in many hyperscalers.
>
>
> Possible solutions
> ==================
>
> 1. Do nothing (status quo). The next kernel hands out known-bad
>    RAM, and hope for the best. 
>
> 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
>    frame would simply never become RAM (no allocator race at all).
>    But: it is x86-only (no arm64 equivalent in the same mechanism;
>    this series is tested on arm64);
>
> 3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
>    device poison lists). This is the correct layer for *cross power
>    cycle* persistence and is complementary to this work. But it is
>    per-platform, out of OS control, not universally available, and
>    cannot carry OS-discovered or software-simulated poison
>    (MADV_HWPOISON, the injector). kexec can also happen long before
>    firmware retirement takes effect.

I don't know enough about these two to say whether they are a good idea
or not. I'll let more competent people comment on that.

>
> 4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
>    for the next kexec kernel cmdline.

Does this scale at all? How many poisoned pages do you expect to see on
a big machine with say a few terabytes of RAM? Won't the commandline end
up being way too big?

>
> 5. A bespoke kexec segment / setup_data blob. This reinvents what
>    KHO already provides - preserved memory plus an FDT handoff to
>    the next kernel - which is the upstream-blessed generic mechanism
>    for exactly this kind of state.

Yeah, at that point you'd be better off using KHO directly. You won't
have to muck about with architecture specific bits.

>
> This PoC
> ========
>
>   * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
>     HandOver) to carry the poison list between kernels.
>
>   * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
>     chokepoint for every poison/unpoison event - and records each
>     poisoned PFN into a vmalloc array that KHO preserves across the
>     kexec, described by a small versioned "hwpoison" subtree.

More of an implementation detail, but with vmalloc array, what if you
have too many poisoned pages?

We have two alternative data structures in KHO: the radix tree [0] and
KHO block [1]. I think the radix tree will be inefficient unless you
have a _lot_ of poisoned pages, but KHO block is likely going to work a
lot better because you don't have to define the buffer size up front.

>
>   * Consumer: early in the next boot (fs_initcall_sync, before the
>     buddy allocator has handed anything out) it restores that array

Why do you say so? Buddy is up and running after memblock_free_all(),
which happens from mm_core_init(), and memblock goes away entirely in
page_alloc_init_late(), which runs right after early initcalls but
before any other levels. See kernel_init_freeable().

Concrete example: hugetlb is initialized at subsys_initcall(), and
allocates all its 2M hugepages from buddy.

So I think you probably need to do this somewhere in
page_alloc_init_late(), after all the deferred struct pages are
initialized.

>     and re-runs memory_failure() on each PFN, re-offlining the frame
>     and rebuilding the full hwpoison state (PG_hwpoison, counters,
>     HardwareCorrupted).
>
>   * The replay feeds back through the producer, so the list
>     re-publishes itself and survives an arbitrary chain of kexecs.
>
>
> Open questions
> ==============
>
>   * Is there any alternative I am not seeing?
>
>   * Is a dedicated "hwpoison" subtree the right granularity, or
>     should this live under a broader RAS/KHO umbrella?

What's "RAS"?

A dedicated subtree sounds the best to me, but I think an alternate
option is to stick it into kexec-metadata. But I don't know what we'd
gain from doing so.

I _don't_ think it belongs in the base KHO ABI.

>
>   * Trusting the inherited list: should the next kernel bound the count /
>     validate PFNs against its own memory map before replaying?

Normally with KHO you assume the memory map does not change and the
previous kernel is trustworthy. You can try to do basic validation of
the memory map to ensure correctness, but I think in the end you just
_have_ to trust the previous kernel.

>
> Limitations
> ===========
>
>   * Poison events before KHO init (fs_initcall) cannot be published;
>     academic in practice as MCEs do not fire that early.
>
>   * Per-page only. Cross-power-cycle retirement of a whole DIMM
>     is not covered.
>
> I've got a PoC working, and it is available in here, in case you are interested
> in the details I am playing with
>
>   https://github.com/leitao/linux/tree/b4/hwpoison

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kho_radix_tree.h
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kho_block.h

-- 
Regards,
Pratyush Yadav

next prev parent reply	other threads:[~2026-06-24 13:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
2026-06-24 12:04 ` Kiryl Shutsemau
2026-06-24 13:46   ` Pratyush Yadav
2026-06-24 15:21   ` Breno Leitao
2026-06-24 15:34     ` Kiryl Shutsemau
2026-06-24 13:40 ` Pratyush Yadav [this message]
2026-06-24 14:44   ` Rik van Riel
2026-06-24 15:17     ` Pratyush Yadav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2vxzse6ckqfg.fsf@kernel.org \
    --to=pratyush@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=baoquan.he@linux.dev \
    --cc=caggio@meta.com \
    --cc=david@kernel.org \
    --cc=kas@kernel.org \
    --cc=kexec@lists.infradead.org \
    --cc=lance.yang@linux.dev \
    --cc=leitao@debian.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=nao.horiguchi@gmail.com \
    --cc=riel@surriel.com \
    --cc=rneu@meta.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox