mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
@ 2026-06-24 10:39 Breno Leitao
  2026-06-24 12:04 ` Kiryl Shutsemau
  2026-06-24 13:40 ` Pratyush Yadav
  0 siblings, 2 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-24 10:39 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, pratyush
  Cc: kexec, linux-mm, rneu, riel, caggio, kas

TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
next kernel doesn't hand out known-bad RAM.

The problem
===========

When a page is hard-offlined due to an uncorrectable memory error (multi bit
ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
that lives in the running kernel's data structures, not in the hardware.

A kexec replaces the kernel image but not the physical DRAM. The next kernel
rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
ordinary system RAM, and the buddy allocator hands it back out on the next
kernel. The known-bad cell is silently back in circulation, and the next
access faults again - potentially in a context that is harder to recover than
the original (e.g. a kernel allocation rather than a killable user page).

This matters most where kexec is frequent and machines are long-lived: live
kernel update on large fleets. Poison knowledge accumulates over uptime and is
thrown away on every update.

This is the case at Meta and in many hyperscalers.

Possible solutions
==================

1. Do nothing (status quo). The next kernel hands out known-bad
   RAM, and hope for the best. 

2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
   frame would simply never become RAM (no allocator race at all).
   But: it is x86-only (no arm64 equivalent in the same mechanism;
   this series is tested on arm64);

3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
   device poison lists). This is the correct layer for *cross power
   cycle* persistence and is complementary to this work. But it is
   per-platform, out of OS control, not universally available, and
   cannot carry OS-discovered or software-simulated poison
   (MADV_HWPOISON, the injector). kexec can also happen long before
   firmware retirement takes effect.

4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
   for the next kexec kernel cmdline.

5. A bespoke kexec segment / setup_data blob. This reinvents what
   KHO already provides - preserved memory plus an FDT handoff to
   the next kernel - which is the upstream-blessed generic mechanism
   for exactly this kind of state.

This PoC
========

  * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
    HandOver) to carry the poison list between kernels.

  * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
    chokepoint for every poison/unpoison event - and records each
    poisoned PFN into a vmalloc array that KHO preserves across the
    kexec, described by a small versioned "hwpoison" subtree.

  * Consumer: early in the next boot (fs_initcall_sync, before the
    buddy allocator has handed anything out) it restores that array
    and re-runs memory_failure() on each PFN, re-offlining the frame
    and rebuilding the full hwpoison state (PG_hwpoison, counters,
    HardwareCorrupted).

  * The replay feeds back through the producer, so the list
    re-publishes itself and survives an arbitrary chain of kexecs.

Open questions
==============

  * Is there any alternative I am not seeing?

  * Is a dedicated "hwpoison" subtree the right granularity, or
    should this live under a broader RAS/KHO umbrella?

  * Trusting the inherited list: should the next kernel bound the count /
    validate PFNs against its own memory map before replaying?

Limitations
===========

  * Poison events before KHO init (fs_initcall) cannot be published;
    academic in practice as MCEs do not fire that early.

  * Per-page only. Cross-power-cycle retirement of a whole DIMM
    is not covered.

I've got a PoC working, and it is available in here, in case you are interested
in the details I am playing with

  https://github.com/leitao/linux/tree/b4/hwpoison

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
  2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
@ 2026-06-24 12:04 ` Kiryl Shutsemau
  2026-06-24 13:46   ` Pratyush Yadav
  2026-06-24 13:40 ` Pratyush Yadav
  1 sibling, 1 reply; 6+ messages in thread
From: Kiryl Shutsemau @ 2026-06-24 12:04 UTC (permalink / raw)
  To: Breno Leitao, Ard Biesheuvel
  Cc: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, pratyush, kexec, linux-mm, rneu, riel, caggio

On Wed, Jun 24, 2026 at 03:39:38AM -0700, Breno Leitao wrote:
>   * Consumer: early in the next boot (fs_initcall_sync, before the
>     buddy allocator has handed anything out) it restores that array
>     and re-runs memory_failure() on each PFN, re-offlining the frame
>     and rebuilding the full hwpoison state (PG_hwpoison, counters,
>     HardwareCorrupted).

fs_initcall_sync is not before buddy hands anything out - buddy has been
live since memblock_free_all() in start_kernel(), and every initcall before
this one has allocated freely. So this is recovery, not prevention: you may
be running memory_failure() against a frame already in use, possibly by a
kernel allocation.

Two windows are missed entirely:

  - memblock allocations between setup_arch() and memblock_free_all()
    (page tables, mem_map[], percpu) can land on the bad frame.

  - The kernel image itself: KASLR picks its location in the
    decompressor/stub, long before any initcall. The next kernel can end
    up running *on* the bad frame.

So I don't think this should be a memory_failure() replay. The frames need
to leave the next kernel's view at the memory-map level, before memblock
and KASLR.

> Possible solutions
> ==================
...
> 
> 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
>    frame would simply never become RAM (no allocator race at all).
>    But: it is x86-only (no arm64 equivalent in the same mechanism;
>    this series is tested on arm64);

(+Ard. I might get some details around EFI wrong.)

This isn't accurate, and I think it's the right direction for EFI
platforms. EFI_UNUSABLE_MEMORY is honored on both arches today, no new
consumer code:

  - arm64: reserve_regions() marks non-usable memory nomap.
  - x86: do_add_efi_memmap() maps it to E820_TYPE_UNUSABLE.

And it closes the KASLR window for free, because the image is only placed in
EFI_CONVENTIONAL_MEMORY on both (x86 process_efi_entries(), arm64
randomalloc.c). So the bad frame is invisible to both the allocator and
KASLR, which is exactly what fs_initcall_sync can't give you.

There's also LINUX_EFI_MEMRESERVE (efi_mem_reserve_persistent()) -
cross-arch, reserved pre-buddy in efi_init() - and looks otherwise fine, but
it's parsed too late to keep KASLR off the frame.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
  2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
  2026-06-24 12:04 ` Kiryl Shutsemau
@ 2026-06-24 13:40 ` Pratyush Yadav
  2026-06-24 14:44   ` Rik van Riel
  1 sibling, 1 reply; 6+ messages in thread
From: Pratyush Yadav @ 2026-06-24 13:40 UTC (permalink / raw)
  To: Breno Leitao
  Cc: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, pratyush, kexec, linux-mm, rneu, riel, caggio, kas

On Wed, Jun 24 2026, Breno Leitao wrote:

> TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the
> next kernel doesn't hand out known-bad RAM.
>
> The problem
> ===========
>
> When a page is hard-offlined due to an uncorrectable memory error (multi bit
> ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy
> allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of
> that lives in the running kernel's data structures, not in the hardware.
>
> A kexec replaces the kernel image but not the physical DRAM. The next kernel
> rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as
> ordinary system RAM, and the buddy allocator hands it back out on the next
> kernel. The known-bad cell is silently back in circulation, and the next
> access faults again - potentially in a context that is harder to recover than
> the original (e.g. a kernel allocation rather than a killable user page).
>
> This matters most where kexec is frequent and machines are long-lived: live
> kernel update on large fleets. Poison knowledge accumulates over uptime and is
> thrown away on every update.

I don't know much about hardware page poisoning. From what I understand
of your description, it seems like the hardware fires off an event that
the kernel receives, and then the kernel stores the state, and hardware
"forgets" it. So what happens when something accesses the poisoned
memory anyway? Do we get another hwpoison event?

Also, what happens on cold reboot? If the HW does not remember bad
pages, won't the kernel be in the same position? How does it know the
bad pages on a cold boot?

>
> This is the case at Meta and in many hyperscalers.
>
>
> Possible solutions
> ==================
>
> 1. Do nothing (status quo). The next kernel hands out known-bad
>    RAM, and hope for the best. 
>
> 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
>    frame would simply never become RAM (no allocator race at all).
>    But: it is x86-only (no arm64 equivalent in the same mechanism;
>    this series is tested on arm64);
>
> 3. Firmware / platform page retirement (PPR, BMC page-offline, CXL
>    device poison lists). This is the correct layer for *cross power
>    cycle* persistence and is complementary to this work. But it is
>    per-platform, out of OS control, not universally available, and
>    cannot carry OS-discovered or software-simulated poison
>    (MADV_HWPOISON, the injector). kexec can also happen long before
>    firmware retirement takes effect.

I don't know enough about these two to say whether they are a good idea
or not. I'll let more competent people comment on that.

>
> 4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem=
>    for the next kexec kernel cmdline.

Does this scale at all? How many poisoned pages do you expect to see on
a big machine with say a few terabytes of RAM? Won't the commandline end
up being way too big?

>
> 5. A bespoke kexec segment / setup_data blob. This reinvents what
>    KHO already provides - preserved memory plus an FDT handoff to
>    the next kernel - which is the upstream-blessed generic mechanism
>    for exactly this kind of state.

Yeah, at that point you'd be better off using KHO directly. You won't
have to muck about with architecture specific bits.

>
> This PoC
> ========
>
>   * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
>     HandOver) to carry the poison list between kernels.
>
>   * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
>     chokepoint for every poison/unpoison event - and records each
>     poisoned PFN into a vmalloc array that KHO preserves across the
>     kexec, described by a small versioned "hwpoison" subtree.

More of an implementation detail, but with vmalloc array, what if you
have too many poisoned pages?

We have two alternative data structures in KHO: the radix tree [0] and
KHO block [1]. I think the radix tree will be inefficient unless you
have a _lot_ of poisoned pages, but KHO block is likely going to work a
lot better because you don't have to define the buffer size up front.

>
>   * Consumer: early in the next boot (fs_initcall_sync, before the
>     buddy allocator has handed anything out) it restores that array

Why do you say so? Buddy is up and running after memblock_free_all(),
which happens from mm_core_init(), and memblock goes away entirely in
page_alloc_init_late(), which runs right after early initcalls but
before any other levels. See kernel_init_freeable().

Concrete example: hugetlb is initialized at subsys_initcall(), and
allocates all its 2M hugepages from buddy.

So I think you probably need to do this somewhere in
page_alloc_init_late(), after all the deferred struct pages are
initialized.

>     and re-runs memory_failure() on each PFN, re-offlining the frame
>     and rebuilding the full hwpoison state (PG_hwpoison, counters,
>     HardwareCorrupted).
>
>   * The replay feeds back through the producer, so the list
>     re-publishes itself and survives an arbitrary chain of kexecs.
>
>
> Open questions
> ==============
>
>   * Is there any alternative I am not seeing?
>
>   * Is a dedicated "hwpoison" subtree the right granularity, or
>     should this live under a broader RAS/KHO umbrella?

What's "RAS"?

A dedicated subtree sounds the best to me, but I think an alternate
option is to stick it into kexec-metadata. But I don't know what we'd
gain from doing so.

I _don't_ think it belongs in the base KHO ABI.

>
>   * Trusting the inherited list: should the next kernel bound the count /
>     validate PFNs against its own memory map before replaying?

Normally with KHO you assume the memory map does not change and the
previous kernel is trustworthy. You can try to do basic validation of
the memory map to ensure correctness, but I think in the end you just
_have_ to trust the previous kernel.

>
> Limitations
> ===========
>
>   * Poison events before KHO init (fs_initcall) cannot be published;
>     academic in practice as MCEs do not fire that early.
>
>   * Per-page only. Cross-power-cycle retirement of a whole DIMM
>     is not covered.
>
> I've got a PoC working, and it is available in here, in case you are interested
> in the details I am playing with
>
>   https://github.com/leitao/linux/tree/b4/hwpoison

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kho_radix_tree.h
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kho_block.h

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
  2026-06-24 12:04 ` Kiryl Shutsemau
@ 2026-06-24 13:46   ` Pratyush Yadav
  0 siblings, 0 replies; 6+ messages in thread
From: Pratyush Yadav @ 2026-06-24 13:46 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Breno Leitao, Ard Biesheuvel, nao.horiguchi, linmiaohe, david,
	lance.yang, akpm, baoquan.he, rppt, pratyush, kexec, linux-mm,
	rneu, riel, caggio

On Wed, Jun 24 2026, Kiryl Shutsemau wrote:

> On Wed, Jun 24, 2026 at 03:39:38AM -0700, Breno Leitao wrote:
>>   * Consumer: early in the next boot (fs_initcall_sync, before the
>>     buddy allocator has handed anything out) it restores that array
>>     and re-runs memory_failure() on each PFN, re-offlining the frame
>>     and rebuilding the full hwpoison state (PG_hwpoison, counters,
>>     HardwareCorrupted).
>
> fs_initcall_sync is not before buddy hands anything out - buddy has been
> live since memblock_free_all() in start_kernel(), and every initcall before
> this one has allocated freely. So this is recovery, not prevention: you may
> be running memory_failure() against a frame already in use, possibly by a
> kernel allocation.
>
> Two windows are missed entirely:
>
>   - memblock allocations between setup_arch() and memblock_free_all()
>     (page tables, mem_map[], percpu) can land on the bad frame.
>
>   - The kernel image itself: KASLR picks its location in the
>     decompressor/stub, long before any initcall. The next kernel can end
>     up running *on* the bad frame.

With KHO, you have "scratch memory", a pre-reserved area of memory on
cold boot. The kernel image is always in this area when KHO is used. I
think it would be a fair idea to deny kexec if any of the pages in this
scratch area are poisoned. Because at that point you can't reliably boot
anyway.

Normally, all allocations between setup_arch() and memblock_free_all()
_also_ happen from scratch memory, so this check would solve the first
problem too... but I recently added patches [0] to change this. So I
think we do need to identify the poisoned pages early in boot.

[0] https://lore.kernel.org/kexec/20260605183501.3884950-16-pratyush@kernel.org/

[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
  2026-06-24 13:40 ` Pratyush Yadav
@ 2026-06-24 14:44   ` Rik van Riel
  2026-06-24 15:17     ` Pratyush Yadav
  0 siblings, 1 reply; 6+ messages in thread
From: Rik van Riel @ 2026-06-24 14:44 UTC (permalink / raw)
  To: Pratyush Yadav, Breno Leitao
  Cc: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, kexec, linux-mm, rneu, caggio, kas

On Wed, 2026-06-24 at 15:40 +0200, Pratyush Yadav wrote:
> 
> Also, what happens on cold reboot? If the HW does not remember bad
> pages, won't the kernel be in the same position? How does it know the
> bad pages on a cold boot?

Some modern server hardware will simply unmap known
bad pages from the physical page map, so they will
not be exposed to the OS after a cold reboot.

The hardware keeps a log of uncorrectable memory
errors somewhere in memory, for example in the SEL.

> 
> 
> > 
> > This PoC
> > ========
> > 
> >   * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
> >     HandOver) to carry the poison list between kernels.
> > 
> >   * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
> >     chokepoint for every poison/unpoison event - and records each
> >     poisoned PFN into a vmalloc array that KHO preserves across the
> >     kexec, described by a small versioned "hwpoison" subtree.
> 
> More of an implementation detail, but with vmalloc array, what if you
> have too many poisoned pages?
> > 

If a very large amount of memory is broken, you
should probably just repair the hardware.

Page poisoning is good for localized memory
failures, but not for failures that extend across
much of a memory chip.

> 
> 
> 
-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
  2026-06-24 14:44   ` Rik van Riel
@ 2026-06-24 15:17     ` Pratyush Yadav
  0 siblings, 0 replies; 6+ messages in thread
From: Pratyush Yadav @ 2026-06-24 15:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pratyush Yadav, Breno Leitao, nao.horiguchi, linmiaohe, david,
	lance.yang, akpm, baoquan.he, rppt, kexec, linux-mm, rneu, caggio,
	kas

On Wed, Jun 24 2026, Rik van Riel wrote:

> On Wed, 2026-06-24 at 15:40 +0200, Pratyush Yadav wrote:
>> 
>> Also, what happens on cold reboot? If the HW does not remember bad
>> pages, won't the kernel be in the same position? How does it know the
>> bad pages on a cold boot?
>
> Some modern server hardware will simply unmap known
> bad pages from the physical page map, so they will
> not be exposed to the OS after a cold reboot.
>
> The hardware keeps a log of uncorrectable memory
> errors somewhere in memory, for example in the SEL.
>
>> 
>> 
>> > 
>> > This PoC
>> > ========
>> > 
>> >   * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
>> >     HandOver) to carry the poison list between kernels.
>> > 
>> >   * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
>> >     chokepoint for every poison/unpoison event - and records each
>> >     poisoned PFN into a vmalloc array that KHO preserves across the
>> >     kexec, described by a small versioned "hwpoison" subtree.
>> 
>> More of an implementation detail, but with vmalloc array, what if you
>> have too many poisoned pages?
>> > 
>
> If a very large amount of memory is broken, you
> should probably just repair the hardware.

"large" is relative. On a 2 TiB system, if you have 0.5% of pages
poisoned (I have no idea if that number is realistic), you have 10 GiB
of memory poisoned, or around 2.6 million pages. To store all their
PFNs, you need around 20 MiB of memory.

While not too large, it isn't trivial either.

I think static data structures like vmalloc are likely not the way to go
here especially when we have better things like KHO block or the KHO
radix tree.

Between those two, what is more efficient largely depends on how many
pages you'd typically see poisoned and what their locations tend to be.
That I think we can dive deeper into when we take a closer look at the
patches.

>
> Page poisoning is good for localized memory
> failures, but not for failures that extend across
> much of a memory chip.


-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-24 15:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 10:39 mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Breno Leitao
2026-06-24 12:04 ` Kiryl Shutsemau
2026-06-24 13:46   ` Pratyush Yadav
2026-06-24 13:40 ` Pratyush Yadav
2026-06-24 14:44   ` Rik van Riel
2026-06-24 15:17     ` Pratyush Yadav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox