* Re: [PATCH v7 0/9] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-26 14:53 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Nick Desaulniers, Bill Wendling, Justin Stitt, Jonathan Corbet,
Shuah Khan, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, linux-kernel,
linux-trace-kernel, linux-kbuild, bpf, llvm, linux-doc,
kernel-team, Nicolas Schier
In-Reply-To: <20260626233327.b5c9c8de494acdde4ddf5c02@kernel.org>
On Fri, Jun 26, 2026 at 11:33:27PM +0900, Masami Hiramatsu wrote:
> > The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> > already landed; this series wires the rendered cmdline into the kernel.
> >
> > Motivation: today the embedded bootconfig is parsed at runtime, after
> > parse_early_param() has already run, so early_param() handlers can't
> > see embedded values. Folding the kernel.* subtree into the cmdline at
> > build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> > users without forcing them to maintain two cmdline sources.
> >
> > Behaviorally, the "kernel" subtree is rendered to a flat string at
> > build time and stashed in .init.rodata. setup_arch() prepends it to
> > boot_command_line before parse_early_param() runs. Overflow is a soft
> > error: the helper logs and leaves boot_command_line untouched rather
> > than panicking, so an oversized embedded bconf cannot brick a boot.
> >
>
> Thanks for update!! This looks good to me.
> Let me pick it and test it.
This is great. Thanks for it and for the support so far.
--breno
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng @ 2026-06-26 15:28 UTC (permalink / raw)
To: Yan Zhao
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <aj3TGLGWT1kMFIVH@yzhao56-desk.sh.intel.com>
Yan Zhao <yan.y.zhao@intel.com> writes:
> On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> >> Sean Christopherson <seanjc@google.com> writes:
>> >>
>> >> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
>> >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> >> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> >> > > > > return -EIO;
>> >> >> > > > >
>> >> >> > > > > - if (!src_page)
>> >> >> > > > > - return -EOPNOTSUPP;
>> >> >> > > > > + if (!src_page) {
>> >> >> > > > > + if (!gmem_in_place_conversion)
>> >> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
>> >> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
>> >> >> > > > error.
>> >> >> > >
>> >> >> > > Why MMAP?
>> >> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
>> >> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
>> >> >> >
>> >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
>> >> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
>> >> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
>> >> >> > > initialize the memory.
>> >> >> > Do you mean using up-to-date flag as below?
>> >> >
>> >> > Yes? I didn't actually look at the implementation details.
>> >> >
>> >> >> > if (!src_page) {
>> >> >> > src_page = pfn_to_page(pfn);
>> >> >> > if (!folio_test_uptodate(page_folio(src_page)))
>> >> >> > return -EOPNOTSUPP;
>> >> >> > }
>> >>
>> >> Yan is right that with the earlier patch "Zero page while getting pfn",
>> >> folio_test_uptodate() here will always return true.
>> >>
>> >> Actually, this is an alternative fix for the issue Sashiko pointed out
>> >> on v7 where userspace can do a populate() (either TDX or SNP) without
>> >> first allocating the page, with src_address == NULL, and leak
>> >> uninitialized memory into the guest.
>> >>
>> >> Advantage of using the uptodate check in populate: if the host never
>> >> allocates the page, populate doesn't incur zeroing before writing the
>> >> page anyway in populate().
>> >>
>> >> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> >> check. guest_memfd can't check centrally because for SNP, for a
>> >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> >> firmware will zero and there's no leakage of uninitialized host memory?
>> > Another disadvantage: the uptodate flag is per-folio. What if the folio
>> > is only partially initialized by the userspace especially after huge page is
>> > supported?
>> >
>>
>> Good point on huge pages!
>>
>> The uptodate flag on the folio in guest_memfd means "this folio has been
>> written to". As of now (before patch at [1]), this happens when
>>
>> + folio is zeroed on first use by userspace
>> + folio is zeroed on first use of the guest
>> + folio is populated
>>
>> When huge pages are supported, the folio can't partially be initialized?
>>
>> On allocation, if any part is shared, we split the page. The parts are
>> separate folios that have their own uptodate flags.
>>
>> On splitting, if the huge page is uptodate, the split pages will also be
>> uptodate. If the huge page is not uptodate, the split pages won't be
>> uptodate, but that's ok since they will be marked uptodate on first use.
>>
>> On merging, the non-uptodate parts have to be zeroed and then marked
> If that's true, it would be good.
>
>> uptodate. Any parts that are in use would have been marked uptodate
>> already, so there's no overwriting data that is in use. I'll need to
>> think more about when it's safe to zero.
>>
>> I'm still on the fence between the two options
>>
>> 1. Using uptodate check in populate to reject src_pages that have never
>> been written to or
>> 2. Always zero before populate
> 2 does not work?
> The flow is
> 1. mmap gmem_fd, make GFN shared, and write initial content.
> 2. convert GFN to private
> 3. invoke ioctl to trigger populate.
>
This flow is correct, is what users of in-place conversion should do.
"Always" is the wrong word, I should have said "zero if not uptodate
before populate", as in, with patch at [1].
By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate
gets the pfn, the page would be zeroed, either because userspace faulted
it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if
userspace never faulted it in, the zeroing would happen because
populate() allocated the page.
>> but whether the uptodate flag is per-folio or not doesn't affect these
>> two options in terms of fixing the leak of uninitialized host memory,
>> right?
> yes, provided "On merging, the non-uptodate parts have to be zeroed and then
> marked uptodate".
>
Thank you so much for bringing this up, I hadn't considered this
before. I'll do that when I get to guest_memfd hugepage restructuring.
>> >
>> >> >> Another concern with this fix is that:
>> >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
>> >> >> folio uptodate before reaching post_populate().
>> >> >>
>> >> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
>> >> >>
>> >> >> > One concern is that TDX now does not much care about the up-to-date flag since
>> >> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> >> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
>> >> >> > user access.
>> >> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
>> >> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >> >
>> >> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
>> >> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
>> >> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
>> >> >> > > > 1. create guest_memfd with MMAP flag
>> >> >> > > > 2. mmap the guest_memfd.
>> >> >> > > > 3. convert the initial memory range to shared.
>> >> >> > > > 4. copy initial content to the source page.
>> >> >> > > > 5. convert the initial memory range to private
>> >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> >> > > > 7. do not unmap the source backend.
>> >> >> > > >
>> >> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
>> >> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> >> > >
>> >> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
>> >> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>> >>
>> >> Yan, is your concern that userspace forgot to update the code and
>> >> forgets to provide a src_page, and if we keep the "Zero page while
>> > Yes. Previously, it would be rejected after GUP fails.
>> >
>>
>> I see, didn't realize previously it would be rejected because GUP
>> fails. GUP failed because it wasn't faulted into the host?
> GUP fails if 0 is not a valid user address.
> But GUP would not fail if 0 is a valid address. e.g., in below scenario:
>
> #include <sys/mman.h>
> #include <stdio.h>
> int main(void)
> {
> void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
> if (p==MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> *(char*)0='Y';
> printf("addr0=%p val=%c\n",p,*(char*)0);
> return 0;
> }
>
>
>> That's kind of orthogonal, I don't think GUP fail leading to rejecting
>> populate was meant to help userspace catch these issues. GUP would also
>> fail if the user did mmap(), write to it, unmap using
>> madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
> The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was
> rejected depended on whether the user mmap()'d address 0. If 0 was a valid
> mapping, populate() could proceed.
>
> commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
> guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0
> uaddr.
>
I see, I only looked at this after commit 2a62345b3052.
> But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
> source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
> treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
>
I'd say the original uAPI perhaps just didn't document 0 as an
unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
was perhaps accidentally changed and no customer complained, I think we
can move forward with 0 as an invalid src_address? I wouldn't think
anyone relies on 0 intentionally being a valid address.
I could document that, if it helps?
>
>> >> getting pfn" patch, ends up with the guest silently having a zero page?
>> >> I think that would be found quite early in userspace VMM testing...
>> > I actually encountered this during testing this patch.
>> > I update most code path to follow this sequence. However, still some corner ones
>> > for TDVF HOB, which are less obvious and harder to update.
>> > The TD just booted up and hang silently.
>> >
>>
>> I think this is just the life of a close-to-hardware software engineer
>> :P no errors, got stuck somewhere, root cause is some unitialized
>> thing.
>>
>> >> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
>> >> >> > kernel to detect this mistake, similar to how it validates whether source_addr
>> >> >> > is PAGE_ALIGNED.
>> >> >
>> >> > The alignment case is different. If userspace provides an unaligned value, KVM
>> >> > *can't* do what userspace is asking because hardware and thus KVM only supports
>> >> > converting on page boundaries.
>> >> >
>> >> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
>> >> > request would then be making assumptions about what userspace wants.
>> >> >
>> >>
>> >> Also, +1 on this, what if userspace, knowing that pages are zeroed on
>> >> allocation, actually wants to rely on that to get a zero page in the guest?
>> > What if 0 uaddr is a valid address? :)
>> >
>> >> >> > Since userspace already needs to perform additional steps to enable in-place
>> >> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
>> >> >> > intentional seems like a reasonable burden.
>> >> >
>> >> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
>> >> > just wen up with code that does:
>> >> >
>> >> > if (in-place) {
>> >> > src = NULL;
>> >> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
>> >> > }
>> >>
^ permalink raw reply
* [PATCH v10 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
The first entry of error_states[],
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
is unreachable. identify_page_state() has two callers, and neither
one can dispatch a PG_reserved page to me_kernel():
* memory_failure() reaches identify_page_state() only after
get_hwpoison_page() returned 1. get_any_page() reaches that
return only via __get_hwpoison_page(), which only takes a
refcount when the page is HWPoisonHandlable().
HWPoisonHandlable() is an allowlist for LRU, free-buddy, and
(for soft-offline) movable_ops pages -- PG_reserved pages do
not satisfy any of these, so they fail with -EBUSY/-EIO long
before identify_page_state() runs.
* try_memory_failure_hugetlb() reaches identify_page_state() only
via the MF_HUGETLB_IN_USED branch, where the page is necessarily
a hugetlb folio. hugetlb folios don't carry PG_reserved at that
point: hugetlb_folio_init_vmemmap() calls __folio_clear_reserved()
during init, so the reserved entry would not match even if it
were still present.
me_kernel() never executes and the entry exists only to be matched
against by code that cannot see it.
Drop the entry, the me_kernel() helper, and the now-unused
"reserved" macro. Leave the MF_MSG_KERNEL enum value in place: it
remains part of the tracepoint and pr_err() string tables, and
follow-on work to classify unrecoverable kernel pages can reuse it
without churning the user-visible enum.
No functional change.
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c4055..f4d3e6e20e13f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -980,17 +980,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
return false;
}
-/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page_state *ps, struct page *p)
-{
- unlock_page(p);
- return MF_IGNORED;
-}
-
/*
* Page in unknown state. Do nothing.
* This is a catch-all in case we fail to make sense of the page state.
@@ -1199,10 +1188,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
#define mlock (1UL << PG_mlocked)
#define lru (1UL << PG_lru)
#define head (1UL << PG_head)
-#define reserved (1UL << PG_reserved)
static struct page_state error_states[] = {
- { reserved, reserved, MF_MSG_KERNEL, me_kernel },
/*
* free pages are specially detected outside this table:
* PG_buddy pages only make a small fraction of all free pages.
@@ -1234,7 +1221,6 @@ static struct page_state error_states[] = {
#undef mlock
#undef lru
#undef head
-#undef reserved
static void update_per_node_mf_stats(unsigned long pfn,
enum mf_result result)
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
get_any_page() collapses every HWPoisonHandlable() rejection into a
single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
-> retry path. That is correct for the transient case (a userspace
folio briefly off LRU during migration or compaction, which a later
shake can drag back), but wrong for stable kernel-owned pages: slab,
page-table, large-kmalloc and PG_reserved pages will never become
HWPoisonHandlable(), so the retry loop is wasted work and the final
-EIO loses the "this is structurally unrecoverable" information.
memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
panic-on-unrecoverable sysctl deliberately does not act on.
Introduce is_kernel_owned_page(), a small predicate that positively
identifies pages the hwpoison handler cannot recover from:
is_kernel_owned_page(p) :=
PageReserved(p) ||
PageSlab(head) || PageTable(head) || PageLargeKmalloc(head)
where head = compound_head(p).
PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
page directly. The slab, page-table and large-kmalloc page-type bits
are only stored on the head page, so those tests resolve the compound
head first, then re-read compound_head(page) afterwards: a concurrent
split or compound free that moves head invalidates the just-read flags
and the loop retries. The lookup still takes no refcount, mirroring
the rest of get_any_page(); the recheck closes the common split race,
and a residual free->alloc->free in the same window can only mis-tag
a genuinely poisoned page, never reclassify a handlable one.
No MF_SOFT_OFFLINE / page_has_movable_ops() opt-out is needed: a
movable_ops page is always PageOffline or PageZsmalloc, whose
page_type is mutually exclusive with slab, page-table and
large-kmalloc, and it never carries PG_reserved, so it can never
match any of the checks above.
The list is intentionally not exhaustive. vmalloc and kernel-stack
pages, for example, do not carry a page_type bit and would need a
different oracle; they keep going through the existing retry path
unchanged. This is the smallest set we can identify with certainty
by page type.
Wire the helper into the top of get_any_page() to short-circuit
those pages before the retry loop runs. On a hit, drop the caller's
MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
straight away. Pages outside the helper's positive list still take
the existing retry path and return -EIO, leaving operator-visible
behaviour for those cases unchanged.
Extend the unhandlable-page pr_err() to fire for either errno and
update the get_hwpoison_page() kerneldoc to document the new return.
memory_failure() still folds every negative return into
MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
this patch on its own only changes the errno that soft_offline_page()
can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
through memory_failure() and reports MF_MSG_KERNEL for the
unrecoverable cases, which is what the
panic_on_unrecoverable_memory_failure sysctl observes.
Suggested-by: David Hildenbrand <david@kernel.org>
Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 48 insertions(+), 2 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f4d3e6e20e13f..d08fbd0d8c39f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1325,6 +1325,36 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
return PageLRU(page) || is_free_buddy_page(page);
}
+/*
+ * Positive identification of pages the hwpoison handler cannot recover:
+ * pages owned by kernel internals with no userspace mapping to unmap, no
+ * file mapping to invalidate, and no migration target.
+ */
+static inline bool is_kernel_owned_page(struct page *page)
+{
+ struct page *head;
+ bool kernel_owned;
+
+ /* PG_reserved is a per-page flag, never set on a compound page. */
+ if (PageReserved(page))
+ return true;
+
+ /*
+ * Page-type bits live only on the head page, so resolve any tail
+ * first. The check takes no refcount; recheck the head afterwards
+ * so a concurrent split or compound free cannot leave us trusting
+ * a stale view. A free->alloc->free in the same window is still
+ * possible but closing it would require taking a reference here.
+ */
+retry:
+ head = compound_head(page);
+ kernel_owned = PageSlab(head) || PageTable(head) ||
+ PageLargeKmalloc(head);
+ if (head != compound_head(page))
+ goto retry;
+ return kernel_owned;
+}
+
static int __get_hwpoison_page(struct page *page, unsigned long flags)
{
struct folio *folio = page_folio(page);
@@ -1371,6 +1401,19 @@ static int get_any_page(struct page *p, unsigned long flags)
if (flags & MF_COUNT_INCREASED)
count_increased = true;
+ /*
+ * Page types we know are kernel-owned and cannot be recovered.
+ * Short-circuit before the shake_page() / retry loop, which
+ * cannot turn any of these into something HWPoisonHandlable().
+ * Drop the caller's reference if MF_COUNT_INCREASED took one.
+ */
+ if (is_kernel_owned_page(p)) {
+ if (count_increased)
+ put_page(p);
+ ret = -ENOTRECOVERABLE;
+ goto out;
+ }
+
try_again:
if (!count_increased) {
ret = __get_hwpoison_page(p, flags);
@@ -1418,7 +1461,7 @@ static int get_any_page(struct page *p, unsigned long flags)
ret = -EIO;
}
out:
- if (ret == -EIO)
+ if (ret == -EIO || ret == -ENOTRECOVERABLE)
pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
return ret;
@@ -1475,7 +1518,10 @@ static int __get_unpoison_page(struct page *page)
* -EIO for pages on which we can not handle memory errors,
* -EBUSY when get_hwpoison_page() has raced with page lifecycle
* operations like allocation and free,
- * -EHWPOISON when the page is hwpoisoned and taken off from buddy.
+ * -EHWPOISON when the page is hwpoisoned and taken off from buddy,
+ * -ENOTRECOVERABLE for kernel-owned pages identified by
+ * is_kernel_owned_page() (PG_reserved, slab,
+ * page-table, large-kmalloc) that the handler cannot recover.
*/
static int get_hwpoison_page(struct page *p, unsigned long flags)
{
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
for stable unhandlable kernel pages (PG_reserved, slab, page tables,
large-kmalloc). memory_failure() still folds every negative return
into MF_MSG_GET_HWPOISON, so callers that want to react to the
unrecoverable cases (a panic option, smarter logging) cannot tell
them apart from transient page-allocator races.
Turn the post-call branch into a switch over the get_hwpoison_page()
return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
negative return to MF_MSG_GET_HWPOISON. case 0 keeps the existing
free-buddy / kernel-high-order handling and case 1 falls through to
the rest of memory_failure() unchanged.
The MF_MSG_KERNEL label and tracepoint string are kept as
"reserved kernel page" to avoid breaking userspace tools that match
on those literals; the enum value still adequately tags the failure
even though it now also covers slab, page tables and large-kmalloc
pages.
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d08fbd0d8c39f..8e2aa2fafc14e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2434,7 +2434,8 @@ int memory_failure(unsigned long pfn, int flags)
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
res = get_hwpoison_page(p, flags);
- if (!res) {
+ switch (res) {
+ case 0:
if (is_free_buddy_page(p)) {
if (take_page_off_buddy(p)) {
page_ref_inc(p);
@@ -2453,7 +2454,19 @@ int memory_failure(unsigned long pfn, int flags)
res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
}
goto unlock_mutex;
- } else if (res < 0) {
+ case 1:
+ /* Got a refcount on a handlable page. */
+ break;
+ case -ENOTRECOVERABLE:
+ /*
+ * Stable unhandlable kernel-owned page (PG_reserved,
+ * slab, page tables, large-kmalloc).
+ * No recovery possible.
+ */
+ res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+ goto unlock_mutex;
+ default:
+ /* Transient lifecycle race with the page allocator. */
res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
goto unlock_mutex;
}
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
default) that triggers a kernel panic when memory_failure()
encounters pages that cannot be recovered. This provides a clean
crash with useful debug information rather than allowing silent
data corruption or a delayed crash at an unrelated code path.
Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
result == MF_IGNORED panics. After the previous patch, MF_MSG_KERNEL
covers PG_reserved pages and the kernel-owned pages promoted from
get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
large-kmalloc).
All other action types are excluded:
- MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
transient refcount races with the page allocator (an in-flight buddy
allocation has refcount 0 and is no longer on the buddy free list,
briefly), and panicking on them would risk killing the box for what
is actually a recoverable userspace page.
- MF_MSG_UNKNOWN means identify_page_state() could not classify the
page; that is precisely the wrong basis for a panic decision.
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8e2aa2fafc14e..611160c98c6f6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
+ },
+ {
+ .procname = "panic_on_unrecoverable_memory_failure",
+ .data = &sysctl_panic_on_unrecoverable_mf,
+ .maxlen = sizeof(sysctl_panic_on_unrecoverable_mf),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
}
};
@@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
++mf_stats->total;
}
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+ enum mf_result result)
+{
+ if (!sysctl_panic_on_unrecoverable_mf)
+ return false;
+
+ return type == MF_MSG_KERNEL && result == MF_IGNORED;
+}
+
/*
* "Dirty/Clean" indication is not 100% accurate due to the possibility of
* setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1272,6 +1292,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
pr_err("%#lx: recovery action for %s: %s\n",
pfn, action_page_types[type], action_name[result]);
+ if (panic_on_unrecoverable_mf(type, result))
+ panic("Memory failure: %#lx: unrecoverable page", pfn);
+
return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
}
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 5/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing which failures trigger a panic (kernel-owned pages
the handler cannot recover) and which are intentionally left out
(transient allocator races and unclassified pages).
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++++
1 file changed, 80 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index b9b0c218bfb44..22cc54cac3b21 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
- page-cluster
- page_lock_unfairness
- panic_on_oom
+- panic_on_unrecoverable_memory_failure
- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
@@ -925,6 +926,85 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation. This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on memory failure events
+hitting kernel-owned pages that the handler cannot recover:
+``PageReserved`` (firmware reservations, kernel image, vDSO, zero
+page, and similar memblock-reserved regions), ``PageSlab``,
+``PageTable``, and ``PageLargeKmalloc``. These are owned by the
+kernel and the memory failure handler cannot reliably evict their
+contents.
+
+Other unrecoverable kernel-owned populations (vmalloc allocations,
+kernel stack pages, ...) are not currently covered because the
+handler has no page-type signal that distinguishes them from a
+userspace folio temporarily off the LRU during migration or
+compaction. Such pages still go through the standard
+MF_MSG_GET_HWPOISON path: ``PG_hwpoison`` is set on them and a
+delayed crash on the next access remains possible. Coverage may
+grow as the handler gains stronger kernel-ownership signals.
+
+Recoverable failure paths are also intentionally left out: in-flight
+buddy allocations and other transient races with the page allocator
+can reach the same diagnostic, and panicking on them would risk
+killing the box for a page destined for userspace where the standard
+SIGBUS recovery path applies. Pages whose state could not be
+classified at all are not covered either, since an unknown state is
+not a sound basis for a panic decision.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+ regularly and post-mortem analysis of an unrelated downstream crash
+ (often seconds to minutes after the original error) consumes
+ significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+ hardware error produces a vmcore that still contains the faulting
+ address, the affected page state, and the originating MCE/GHES
+ record — context that is typically lost by the time a delayed crash
+ occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+ failure for failover, and prefer an immediate panic over silent data
+ corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+ tools such as ``mce-inject`` or error-injection debugfs interfaces,
+ where panicking on the unrecoverable path makes regressions
+ immediately visible instead of surfacing as later, unrelated
+ failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately. If the ``panic`` sysctl is also non-zero then the
+ machine will be rebooted.
+= =====================================================================
+
+Example::
+
+ echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
percpu_pagelist_high_fraction
=============================
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 6/6] selftests/mm: add hwpoison-panic destructive test
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
Add a destructive selftest that verifies
vm.panic_on_unrecoverable_memory_failure actually panics when a
hwpoison error hits a kernel-owned page.
Three "kinds" of kernel-owned page can be targeted, selectable via
the script's first positional argument (default: rodata):
rodata - a PG_reserved page in the kernel rodata range, sourced
from the "Kernel rodata" sub-resource of "System RAM" in
/proc/iomem. That entry is reported on every major
architecture and guarantees the chosen PFN is backed by
struct page (an online System RAM range, not a firmware
hole), is PG_reserved, and is read-only -- so even if
the panic fails to fire for some reason, the resulting
PG_hwpoison marker on rodata does not corrupt writable
kernel state.
slab - a slab page found by walking /proc/kpageflags for the
first PFN with KPF_SLAB set (and KPF_HWPOISON / KPF_NOPAGE
/ KPF_COMPOUND_TAIL clear). Exercises the get_any_page()
path on a non PG_reserved kernel-owned page and so
catches regressions where get_any_page() collapses
kernel-owned pages into a transient -EIO instead of
-ENOTRECOVERABLE.
pgtable - same as slab, but the PFN is selected via KPF_PGTABLE.
PageLargeKmalloc, the fourth page type matched by
is_kernel_owned_page(), is intentionally not covered: it is a
PAGE_TYPE_OPS flag with no /proc/kpageflags bit, so selecting such
a PFN from userspace is not feasible. The slab and pgtable
variants already exercise the same get_any_page() positive-check
branch.
The script enables the sysctl and writes the selected physical
address to /sys/devices/system/memory/hard_offline_page. A
successful run crashes the kernel with
Memory failure: <pfn>: unrecoverable page
A return from the inject means no panic fired. Before reporting, the
script restores the sysctl and best-effort unpoisons the target PFN
through the hwpoison debugfs interface (hard_offline_page() injects
with MF_SW_SIMULATED, so the page stays unpoisonable), then re-reads
/proc/kpageflags: a PFN that is still the kernel-owned type it selected
is a genuine failure, while one that raced to a different type before
the inject is skipped as inconclusive. Test outcome is therefore
observed externally (serial console, kdump) rather than from the
script's own exit code.
The script is intentionally NOT wired into run_vmtests.sh: every
successful run panics the kernel, which is incompatible with the
sequential "run each category in the same VM" model that
run_vmtests.sh assumes. It is also not registered as a TEST_PROGS /
ksft_* wrapper so a default kselftest run does not opt itself into
a panic. The script is meant to be executed manually inside a
disposable VM (e.g. virtme-ng), one variant per VM boot, and
requires RUN_DESTRUCTIVE=1 in the environment as a safety net.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
tools/testing/selftests/mm/Makefile | 4 +
tools/testing/selftests/mm/hwpoison-panic.sh | 249 +++++++++++++++++++++++++++
2 files changed, 253 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e6df968f0971c..ed321ae709dac 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -174,6 +174,10 @@ TEST_PROGS += ksft_userfaultfd.sh
TEST_PROGS += ksft_vma_merge.sh
TEST_PROGS += ksft_vmalloc.sh
+# Destructive: every successful run panics the kernel. Installed and
+# kept executable, but not run from a default kselftest invocation.
+TEST_PROGS_EXTENDED += hwpoison-panic.sh
+
TEST_FILES := test_vmalloc.sh
TEST_FILES += test_hmm.sh
TEST_FILES += va_high_addr_switch.sh
diff --git a/tools/testing/selftests/mm/hwpoison-panic.sh b/tools/testing/selftests/mm/hwpoison-panic.sh
new file mode 100755
index 0000000000000..aafc06e895d01
--- /dev/null
+++ b/tools/testing/selftests/mm/hwpoison-panic.sh
@@ -0,0 +1,249 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify vm.panic_on_unrecoverable_memory_failure by injecting a hwpoison
+# error on a kernel-owned page and confirming the kernel panics.
+#
+# Three "kinds" of kernel-owned page can be targeted, selectable via the
+# first positional argument (default: rodata):
+#
+# rodata - a PG_reserved page in the kernel rodata range
+# (sourced from /proc/iomem "Kernel rodata"). Exercises
+# memory_failure() -> get_any_page() on a PageReserved page.
+#
+# slab - a slab page found via /proc/kpageflags (KPF_SLAB).
+# Exercises memory_failure() -> get_any_page() on a non
+# PG_reserved kernel-owned page. This path is what catches
+# regressions where get_any_page() collapses kernel-owned
+# pages into a transient -EIO instead of -ENOTRECOVERABLE.
+#
+# pgtable - a page-table page found via /proc/kpageflags (KPF_PGTABLE).
+# Same path as slab, different page type.
+#
+# This test is DESTRUCTIVE: a successful run crashes the kernel. It is
+# meant to be executed inside a disposable VM (e.g. virtme-ng) with a
+# serial console captured by the harness. It is skipped unless the
+# caller opts in via RUN_DESTRUCTIVE=1.
+#
+# Test passes externally: the kernel must panic with
+# "Memory failure: <pfn>: unrecoverable page"
+# A return from the inject means no panic fired: that is a failure,
+# unless the target PFN raced to a different page type before injection,
+# in which case the run is inconclusive and is skipped.
+#
+# Author: Breno Leitao <leitao@debian.org>
+
+set -u
+
+ksft_skip=4
+sysctl_path=/proc/sys/vm/panic_on_unrecoverable_memory_failure
+inject_path=/sys/devices/system/memory/hard_offline_page
+kpageflags_path=/proc/kpageflags
+unpoison_path=/sys/kernel/debug/hwpoison/unpoison-pfn
+
+# /proc/kpageflags bit positions (see include/uapi/linux/kernel-page-flags.h)
+KPF_SLAB=7
+KPF_COMPOUND_TAIL=16
+KPF_HWPOISON=19
+KPF_NOPAGE=20
+KPF_PGTABLE=26
+KPF_RESERVED=32
+
+pagesize=$(getconf PAGE_SIZE)
+
+kind=${1:-rodata}
+
+ksft_print() { echo "# $*"; }
+ksft_exit_skip() { ksft_print "$*"; exit "$ksft_skip"; }
+ksft_exit_fail() { echo "not ok 1 $*"; exit 1; }
+
+if [ "$(id -u)" -ne 0 ]; then
+ ksft_exit_skip "must run as root"
+fi
+
+if [ ! -w "$sysctl_path" ]; then
+ ksft_exit_skip "$sysctl_path not present (kernel without the sysctl?)"
+fi
+
+if [ ! -w "$inject_path" ]; then
+ ksft_exit_skip "$inject_path not present (no MEMORY_HOTPLUG?)"
+fi
+
+if [ "${RUN_DESTRUCTIVE:-0}" != "1" ]; then
+ ksft_exit_skip "destructive test; re-run with RUN_DESTRUCTIVE=1 inside a disposable VM"
+fi
+
+# Pick a PFN inside the kernel image rodata region of /proc/iomem.
+# This is preferred over a top-level "Reserved" entry because top-level
+# Reserved ranges are often firmware holes that have no backing struct
+# page; pfn_to_online_page() returns NULL on those and memory_failure()
+# bails out with -ENXIO before reaching the panic path.
+#
+# "Kernel rodata" is reported as a sub-resource of "System RAM" on every
+# major architecture, which guarantees:
+# - the PFN is backed by struct page (within an online memory range);
+# - PG_reserved is set on the page (kernel image area);
+# - the memory is read-only, so setting PG_hwpoison on it does not
+# corrupt writable kernel state if the panic somehow does not fire.
+#
+# /proc/iomem entries look like (indented for sub-resources):
+# " 02500000-02ffffff : Kernel rodata"
+pick_rodata_phys_addr() {
+ awk -v pagesize="$(getconf PAGE_SIZE)" '
+ # Convert a hex string to a number without relying on the gawk-only
+ # strtonum(). mawk lacks it and would otherwise spuriously skip
+ # this test on distros that ship mawk as /usr/bin/awk.
+ function hex2num(s, n, i, c, v) {
+ n = 0
+ for (i = 1; i <= length(s); i++) {
+ c = tolower(substr(s, i, 1))
+ v = index("0123456789abcdef", c) - 1
+ if (v < 0)
+ return -1
+ n = n * 16 + v
+ }
+ return n
+ }
+ /: Kernel rodata[[:space:]]*$/ {
+ sub(/^[[:space:]]+/, "")
+ n = split($0, a, /[- ]/)
+ start = hex2num(a[1])
+ end = hex2num(a[2])
+ if (end <= start)
+ next
+ # Page-align upward and emit the first byte of that page.
+ pfn = int((start + pagesize - 1) / pagesize)
+ printf "0x%x\n", pfn * pagesize
+ exit 0
+ }
+ ' /proc/iomem
+}
+
+# Walk /proc/kpageflags and return the phys addr of the first PFN that
+# has bit $1 set, with KPF_HWPOISON, KPF_NOPAGE and KPF_COMPOUND_TAIL
+# all clear (so we attack a real, non-tail, not-already-poisoned page).
+#
+# We skip the first 16 MiB of PFNs to step past low-memory special
+# ranges (BIOS/EFI/ACPI/etc.) that often are PG_reserved and would not
+# exhibit the slab/pgtable type we are looking for.
+pick_kpageflags_phys_addr() {
+ local want_bit=$1
+ local pagesize skip_pfn
+
+ [ -r "$kpageflags_path" ] || return
+
+ pagesize=$(getconf PAGE_SIZE)
+ skip_pfn=$(((16 * 1024 * 1024) / pagesize))
+
+ od -An -tx8 -v -w8 -j "$((skip_pfn * 8))" "$kpageflags_path" 2>/dev/null | \
+ awk -v want_bit="$want_bit" \
+ -v hwp_bit="$KPF_HWPOISON" \
+ -v nopage_bit="$KPF_NOPAGE" \
+ -v tail_bit="$KPF_COMPOUND_TAIL" \
+ -v base_pfn="$skip_pfn" \
+ -v pagesize="$pagesize" '
+ # Test whether bit "b" is set in the 16-hex-digit value "hex".
+ # Done with substring + per-digit lookup so we never rely on awk
+ # bitwise operators (mawk lacks them), 64-bit FP precision or the
+ # gawk-only strtonum().
+ function bit_set(hex, b, di, bi, c, v) {
+ di = int(b / 4)
+ bi = b - di * 4
+ c = substr(hex, length(hex) - di, 1)
+ v = index("0123456789abcdef", tolower(c)) - 1
+ if (bi == 0) return (v % 2) == 1
+ if (bi == 1) return int(v / 2) % 2 == 1
+ if (bi == 2) return int(v / 4) % 2 == 1
+ return int(v / 8) % 2 == 1
+ }
+ {
+ gsub(/^[[:space:]]+/, "")
+ h = $1
+ if (bit_set(h, want_bit) &&
+ !bit_set(h, hwp_bit) &&
+ !bit_set(h, nopage_bit) &&
+ !bit_set(h, tail_bit)) {
+ pfn = base_pfn + NR - 1
+ printf "0x%x\n", pfn * pagesize
+ exit 0
+ }
+ }
+ '
+}
+
+# Return 0 if /proc/kpageflags bit $2 is set for PFN $1, 1 if it is
+# clear, or 2 if the word cannot be read. Used to re-confirm the target
+# page type after a non-panicking inject.
+kpageflags_bit_set() {
+ local word
+
+ word=$(od -An -tx8 -v -j "$(($1 * 8))" -N 8 "$kpageflags_path" 2>/dev/null | tr -d '[:space:]')
+ [ -n "$word" ] || return 2
+ (( (16#$word >> $2) & 1 ))
+}
+
+# Best-effort: drop the PG_hwpoison marker set by the inject so a failed
+# run does not leave a poisoned page behind. hard_offline_page() injects
+# with MF_SW_SIMULATED, so the page stays unpoisonable through the
+# hwpoison debugfs interface (needs CONFIG_HWPOISON_INJECT + debugfs).
+try_unpoison() {
+ [ -w "$unpoison_path" ] || return 0
+ echo "$1" > "$unpoison_path" 2>/dev/null || true
+}
+
+case "$kind" in
+rodata)
+ phys_addr=$(pick_rodata_phys_addr)
+ recheck_bit=$KPF_RESERVED
+ missing_msg='no "Kernel rodata" entry in /proc/iomem'
+ ;;
+slab)
+ phys_addr=$(pick_kpageflags_phys_addr "$KPF_SLAB")
+ recheck_bit=$KPF_SLAB
+ missing_msg="no usable slab PFN found in $kpageflags_path"
+ ;;
+pgtable)
+ phys_addr=$(pick_kpageflags_phys_addr "$KPF_PGTABLE")
+ recheck_bit=$KPF_PGTABLE
+ missing_msg="no usable page-table PFN found in $kpageflags_path"
+ ;;
+*)
+ ksft_exit_fail "unknown kind '$kind' (expected: rodata|slab|pgtable)"
+ ;;
+esac
+
+if [ -z "$phys_addr" ]; then
+ ksft_exit_skip "$missing_msg"
+fi
+
+ksft_print "enabling $sysctl_path"
+prior=$(cat "$sysctl_path")
+echo 1 > "$sysctl_path" || ksft_exit_fail "failed to enable sysctl"
+
+pfn=$((phys_addr / pagesize))
+ksft_print "injecting hwpoison at phys 0x$(printf '%x' "$phys_addr") (pfn 0x$(printf '%x' "$pfn"), kind=$kind)"
+ksft_print "expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'"
+
+# A successful run never returns from the inject -- it panics the kernel.
+# Reaching the code below therefore means no panic fired. Note whether
+# the write itself succeeded, then put the machine back: restore the
+# sysctl and best-effort unpoison the page we just marked.
+if echo "$phys_addr" > "$inject_path"; then
+ verdict="inject returned without panic; sysctl ineffective"
+else
+ verdict="inject failed before reaching the panic path"
+fi
+
+echo "$prior" > "$sysctl_path"
+try_unpoison "$pfn"
+
+# The page type can change between selection and injection (e.g. a slab
+# or page-table page is freed and reused). Only treat a missing panic as
+# a failure if the target PFN is still the kernel-owned type we aimed at;
+# if it raced to another type the run is inconclusive, so skip instead.
+kpageflags_bit_set "$pfn" "$recheck_bit"
+case $? in
+0) ksft_exit_fail "$verdict (page still $kind)" ;;
+1) ksft_exit_skip "target PFN no longer $kind; raced before inject, inconclusive" ;;
+*) ksft_exit_fail "$verdict (could not reconfirm page type via $kpageflags_path)" ;;
+esac
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running. The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash. In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.
This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context. The default is disabled so existing workloads see no
change.
There is a selftest that test different cases, and I tested it using
the following variants:
┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
│ Variant │ PFN │ Result │
├─────────┼──────────┼───────────────────────────────────────────────────────────┤
│ rodata │ 0x2600 │ Panic with "Memory failure: 0x2600: unrecoverable page" │
├─────────┼──────────┼───────────────────────────────────────────────────────────┤
│ slab │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
├─────────┼──────────┼───────────────────────────────────────────────────────────┤
│ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
└─────────┴──────────┴───────────────────────────────────────────────────────────┘
Each one shows the same call trace, exactly the path the series builds:
hard_offline_page_store
→ memory_failure
→ action_result
→ panic("Memory failure: %#lx: unrecoverable page")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v10:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v9: https://lore.kernel.org/r/20260609-ecc_panic-v9-0-432a74002e74@debian.org
Changes in v9:
- HWPoisonKernelOwned(): wrap the head-page checks in a
compound_head() recheck loop so a concurrent split or compound free
cannot leave us trusting a stale view (Miaohe, Lance, David).
- selftest: drop the gawk-only strtonum() in hwpoison-panic.sh; do the
hex parsing with a small index()-based helper so the test no longer
spuriously skips itself on mawk-based distros (Sashiko).
- selftest: move hwpoison-panic.sh from TEST_FILES to
TEST_PROGS_EXTENDED so the script is installed executable rather
than as a non-executable data file (Sashiko).
- Link to v8: https://patch.msgid.link/20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org
Changes in v8:
- Commit message rewording (David)
- Add HWPoisonKernelOwned() helper (Lance)
- Removed patch "mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()"
- Broaden the selftest (Lance)
- Link to v7: https://patch.msgid.link/20260513-ecc_panic-v7-0-be2e578e61da@debian.org
Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
get_any_page() and surface it via -ENOTRECOVERABLE, per David
Hildenbrand's and Lance Yang's review of v6. This drops the
is_reserved snapshot in memory_failure() and the mf_get_page_status
enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
matching tracepoint string; the enum now covers both PG_reserved
pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
and 2/4 ("classify get_any_page() failures by reason") into a
single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
(David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org
Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org
Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
unrecoverable kernel page hwpoison events (reserved pages, refcount-0
non-buddy pages, unknown state), with a recheck to avoid racing with
concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
and add a selftest verifying SIGBUS recovery on userspace pages still
works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org
Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
get_hwpoison_page() and is_free_buddy_page()) cannot cause false
positives: PG_hwpoison is set beforehand and check_new_page() in the
page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
panic conditions (shared path with transient races and non-reserved
kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org
Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
how the retry mechanism mitigates false positives from buddy allocator
races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org
Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org
To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Breno Leitao (6):
mm/memory-failure: drop dead error_states[] entry for reserved pages
mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
mm/memory-failure: add panic option for unrecoverable pages
Documentation: document panic_on_unrecoverable_memory_failure sysctl
selftests/mm: add hwpoison-panic destructive test
Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++
mm/memory-failure.c | 104 +++++++++--
tools/testing/selftests/mm/Makefile | 4 +
tools/testing/selftests/mm/hwpoison-panic.sh | 249 +++++++++++++++++++++++++++
4 files changed, 419 insertions(+), 18 deletions(-)
---
base-commit: 30ffa8de54e5cc80d93fd211ca134d1764a7011f
change-id: 20260323-ecc_panic-4e473b83087c
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply
* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Masami Hiramatsu @ 2026-06-26 15:45 UTC (permalink / raw)
To: Steven Rostedt
Cc: Martin Kaiser, Masami Hiramatsu (Google), linux-trace-kernel,
linux-kernel
In-Reply-To: <20260626064223.60ffeeaa@fedora>
On Fri, 26 Jun 2026 06:42:23 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 26 Jun 2026 12:20:36 +0200
> Martin Kaiser <martin@kaiser.cx> wrote:
>
> > > That is, to have +u0() say "this is going to be dereferencing user space".
> >
> > > I'll add Martin's patch and see if it makes the above work.
> >
> > I've just tried your command with my patch. It works for me, filenames are
> > logged correctly.
>
> Yep, this definitely looks like a fix. We have;
>
> addr = rec + field->offset;
>
> Where addr points to the location of the field on the ring buffer, thus
> your change to make it:
>
> val = *(unsigned long *)addr;
>
> Reads the full "long size" of the event on the ring buffer, instead of
> reading just one byte. It is "val" that gets dereferenced later by the
> probe logic (the "+0u()"), which has all the protections we need.
>
> I'll queue this up.
I've already queued this on my probes/core branch.
(which will be probes/fixes)
Thanks,
>
> Thanks!
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Andrew Morton @ 2026-06-26 16:27 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
On Fri, 26 Jun 2026 08:33:14 -0700 Breno Leitao <leitao@debian.org> wrote:
> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running. The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash. In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
>
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context. The default is disabled so existing workloads see no
> change.
Cool, thanks. I added this to mm.git's mm-new branch. Next week I'll
move it into the mm-unstable branch, where it will receive linux-next
exposure.
Sashiko identified a few possible things, some pre-existing:
https://sashiko.dev/#/patchset/20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org
^ permalink raw reply
* Re: [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Nathan Chancellor @ 2026-06-26 19:03 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
Peter Zijlstra, Julia Lawall, Yury Norov, linux-doc, linux-kbuild,
linuxppc-dev, dri-devel, linux-stm32, linux-arm-kernel,
linux-rdma, linux-usb, linux-ext4, linux-nfs, kvm, intel-gfx
In-Reply-To: <20260626045119.659d1e6b@fedora>
On Fri, Jun 26, 2026 at 04:51:19AM -0400, Steven Rostedt wrote:
> On Thu, 25 Jun 2026 16:41:58 -0700
> Nathan Chancellor <nathan@kernel.org> wrote:
>
>
> > The following diff resolves it for me, should I send it as a separate
> > patch or do you want to just fold it in with a note?
> >
> > diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> > index 621566345406..2301a701ffbb 100644
> > --- a/include/linux/lockdep.h
> > +++ b/include/linux/lockdep.h
> > @@ -10,6 +10,7 @@
> > #ifndef __LINUX_LOCKDEP_H
> > #define __LINUX_LOCKDEP_H
> >
> > +#include <linux/instruction_pointer.h>
>
> Ah, so the reason for this breakage is because lockdep was relying on
> instruction_pointer.h, that just happened to be included in kernel.h
> via trace_printk.h.
Correct.
> This is a separate issue, so it should be a separate patch. I'll add it
> as patch 1 of this series.
Sounds good, thanks!
> Can you send me the config you used. This didn't trigger in my tests.
It is a plain allmodconfig, for example on arm:
$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- allmodconfig lib/test_context-analysis.o
In file included from include/linux/local_lock_internal.h:8,
from include/linux/local_lock.h:5,
from lib/test_context-analysis.c:9:
include/linux/local_lock_internal.h: In function 'local_lock_acquire':
include/linux/lockdep.h:541:87: error: '_THIS_IP_' undeclared (first use in this function)
541 | #define lock_map_acquire(l) lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
| ^~~~~~~~~
include/linux/lockdep.h:509:88: note: in definition of macro 'lock_acquire_exclusive'
509 | #define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i)
| ^
include/linux/local_lock_internal.h:46:9: note: in expansion of macro 'lock_map_acquire'
46 | lock_map_acquire(&l->dep_map);
| ^~~~~~~~~~~~~~~~
include/linux/lockdep.h:541:87: note: each undeclared identifier is reported only once for each function it appears in
...
I also reproduced it on top of allnoconfig:
$ cat allno.config
CONFIG_CONTEXT_ANALYSIS_TEST=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_EXPERT=y
CONFIG_MMU=y
CONFIG_RUNTIME_TESTING_MENU=y
$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- KCONFIG_ALLCONFIG=1 clean allnoconfig lib/test_context-analysis.o
<same error as above>
--
Cheers,
Nathan
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Sean Christopherson @ 2026-06-26 19:06 UTC (permalink / raw)
To: Yan Zhao
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aj3H2sxymOYTWTnE@yzhao56-desk.sh.intel.com>
On Fri, Jun 26, 2026, Yan Zhao wrote:
> On Thu, Jun 25, 2026 at 07:36:28AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 25, 2026, Yan Zhao wrote:
> > And I'm not remotely convinced that prepending allow_ to the param will help
> > end users diagnose "unexpected" memory consumption, in quotes because anyone that
> > is deploying a stack that utilizes out-of-place conversion absolutely needs to
> > understand and plan for the additional memory consumption. I.e. if the memory
> > consumption is "unexpected" to the end user, they likely have far bigger problems.
> My first impression of gmem_in_place_conversion=true was that it enforces gmem
> in-place conversion. However, it actually only enforces per-gmem private/shared
> attribute.
> My worry was that people might think it's a kernel bug if userspace can still
> have shared memory from other sources after they configured
> gmem_in_place_conversion=true.
Ah, I see where you're coming from. FWIW, truly enforcing in-place conversion
is flat out impossible. E.g. userspace can simply replace the memslot, at which
point the memory effectively reverts to shared.
> However, I have no strong opinion if you think gmem_in_place_conversion is good,
> and with the above documentation. :)
Ya, I think this largely a documentation problem. I agree that a param name
like gmem_private_memory_attributes would be more precise, but I think it'd be
far less informative for the vast majority of users that only care whether or
not KVM can do in-place conversion, and don't care about how that is done.
^ permalink raw reply
* [PATCH rfc 0/2] Improvements to ftrace comm[] handling
From: David Laight @ 2026-06-26 21:23 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Michal Koutný
Cc: David Laight
RFC because they are untested and I want to send them before going
on holiday for 2 weeks (should still have email access).
The first patch avoids a lot of 'potentially unsized' string
functions by embedding the char[] use to hold task->comm[] in a
structure.
The second adjsuts the data structure used to cache the task
names in the thread switch code.
David Laight (2):
tracing: Embed 'char comm[16]' in a structure
tracing: Keep pid and comm[] in the same structure
kernel/trace/blktrace.c | 28 +++----
kernel/trace/trace.c | 3 +-
kernel/trace/trace.h | 9 ++-
kernel/trace/trace_events_filter.c | 2 +-
kernel/trace/trace_events_hist.c | 26 +++---
kernel/trace/trace_functions_graph.c | 10 +--
kernel/trace/trace_output.c | 24 +++---
kernel/trace/trace_sched_switch.c | 113 ++++++++++++---------------
8 files changed, 101 insertions(+), 114 deletions(-)
--
2.39.5
^ permalink raw reply
* [PATCH 1/2] tracing: Embed 'char comm[16]' in a structure
From: David Laight @ 2026-06-26 21:23 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Michal Koutný
Cc: David Laight
In-Reply-To: <20260626212356.64150-1-david.laight.linux@gmail.com>
Embedding the array in a stucture makes the size explicit and lets
structure copies be used.
Limit the size to 16 charatacters even if task_struct.comm is
made larger (there are plans to increase it).
Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
kernel/trace/blktrace.c | 28 +++++++++----------
kernel/trace/trace.c | 3 +-
kernel/trace/trace.h | 9 ++++--
kernel/trace/trace_events_filter.c | 2 +-
kernel/trace/trace_events_hist.c | 26 +++++++----------
kernel/trace/trace_functions_graph.c | 10 +++----
kernel/trace/trace_output.c | 24 ++++++++--------
kernel/trace/trace_sched_switch.c | 42 ++++++++++++++++------------
8 files changed, 75 insertions(+), 69 deletions(-)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 8cd2520b4c99..68ffc95548b7 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -1590,20 +1590,20 @@ static void blk_log_dump_pdu(struct trace_seq *s,
static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
- char cmd[TASK_COMM_LEN];
+ struct trace_comm cmd;
- trace_find_cmdline(ent->pid, cmd);
+ trace_find_cmdline(ent->pid, &cmd);
if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
trace_seq_printf(s, "%u ", t_bytes(ent));
blk_log_dump_pdu(s, ent, has_cg);
- trace_seq_printf(s, "[%s]\n", cmd);
+ trace_seq_printf(s, "[%s]\n", cmd.comm);
} else {
if (t_sec(ent))
trace_seq_printf(s, "%llu + %u [%s]\n",
- t_sector(ent), t_sec(ent), cmd);
+ t_sector(ent), t_sec(ent), cmd.comm);
else
- trace_seq_printf(s, "[%s]\n", cmd);
+ trace_seq_printf(s, "[%s]\n", cmd.comm);
}
}
@@ -1637,30 +1637,30 @@ static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bo
static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
- char cmd[TASK_COMM_LEN];
+ struct trace_comm cmd;
- trace_find_cmdline(ent->pid, cmd);
+ trace_find_cmdline(ent->pid, &cmd);
- trace_seq_printf(s, "[%s]\n", cmd);
+ trace_seq_printf(s, "[%s]\n", cmd.comm);
}
static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
- char cmd[TASK_COMM_LEN];
+ struct trace_comm cmd;
- trace_find_cmdline(ent->pid, cmd);
+ trace_find_cmdline(ent->pid, &cmd);
- trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg));
+ trace_seq_printf(s, "[%s] %llu\n", cmd.comm, get_pdu_int(ent, has_cg));
}
static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
- char cmd[TASK_COMM_LEN];
+ struct trace_comm cmd;
- trace_find_cmdline(ent->pid, cmd);
+ trace_find_cmdline(ent->pid, &cmd);
trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
- get_pdu_int(ent, has_cg), cmd);
+ get_pdu_int(ent, has_cg), cmd.comm);
}
static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent,
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..7de658b8ee0d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -52,6 +52,7 @@
#include <linux/sort.h>
#include <linux/io.h> /* vmap_page_range() */
#include <linux/fs_context.h>
+#include <linux/trace_printk.h>
#include <asm/setup.h> /* COMMAND_LINE_SIZE */
@@ -2972,7 +2973,7 @@ print_trace_header(struct seq_file *m, struct trace_iterator *iter)
seq_puts(m, "# -----------------\n");
seq_printf(m, "# | task: %.16s-%d "
"(uid:%d nice:%ld policy:%ld rt_prio:%ld)\n",
- data->comm, data->pid,
+ data->comm.comm, data->pid,
from_kuid_munged(seq_user_ns(m), data->uid), data->nice,
data->policy, data->rt_priority);
seq_puts(m, "# -----------------\n");
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..afd59d79e1fe 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -183,6 +183,11 @@ struct fexit_trace_entry_head {
struct trace_array;
+/* task_struct->comm[] may be truncated to save memory/width */
+struct trace_comm {
+ char comm[16];
+};
+
/*
* The CPU trace array - it consists of thousands of trace entries
* plus some other descriptor data: (for example which task started
@@ -203,7 +208,7 @@ struct trace_array_cpu {
u64 preempt_timestamp;
pid_t pid;
kuid_t uid;
- char comm[TASK_COMM_LEN];
+ struct trace_comm comm;
#ifdef CONFIG_FUNCTION_TRACER
int ftrace_ignore_pid;
@@ -906,7 +911,7 @@ void trace_last_func_repeats(struct trace_array *tr,
extern u64 ftrace_now(int cpu);
-extern void trace_find_cmdline(int pid, char comm[]);
+extern void trace_find_cmdline(int pid, struct trace_comm *comm);
extern int trace_find_tgid(int pid);
extern void trace_event_follow_fork(struct trace_array *tr, bool enable);
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 609325f57942..749887aff315 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -994,7 +994,7 @@ static int filter_pred_comm(struct filter_pred *pred, void *event)
int cmp;
cmp = pred->regex->match(current->comm, pred->regex,
- TASK_COMM_LEN);
+ sizeof(current->comm));
return cmp ^ pred->not;
}
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0dbbf6cca9bc..1b51491b2a41 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -680,7 +680,7 @@ struct track_data {
};
struct hist_elt_data {
- char *comm;
+ struct trace_comm *comm;
u64 *var_ref_vals;
char **field_var_str;
int n_field_var_str;
@@ -756,7 +756,7 @@ static struct track_data *track_data_alloc(unsigned int key_len,
data->elt.private_data = elt_data;
- elt_data->comm = kzalloc(TASK_COMM_LEN, GFP_KERNEL);
+ elt_data->comm = kzalloc_obj(*elt_data->comm);
if (!elt_data->comm) {
track_data_free(data);
return ERR_PTR(-ENOMEM);
@@ -1608,19 +1608,19 @@ parse_hist_trigger_attrs(struct trace_array *tr, char *trigger_str)
return ERR_PTR(ret);
}
-static inline void save_comm(char *comm, struct task_struct *task)
+static inline void save_comm(struct trace_comm *comm, struct task_struct *task)
{
if (!task->pid) {
- strcpy(comm, "<idle>");
+ strcpy(comm->comm, "<idle>");
return;
}
if (WARN_ON_ONCE(task->pid < 0)) {
- strcpy(comm, "<XXX>");
+ strcpy(comm->comm, "<XXX>");
return;
}
- strscpy(comm, task->comm, TASK_COMM_LEN);
+ strscpy(comm->comm, task->comm);
}
static void hist_elt_data_free(struct hist_elt_data *elt_data)
@@ -1646,7 +1646,6 @@ static void hist_trigger_elt_data_free(struct tracing_map_elt *elt)
static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
{
struct hist_trigger_data *hist_data = elt->map->private_data;
- unsigned int size = TASK_COMM_LEN;
struct hist_elt_data *elt_data;
struct hist_field *hist_field;
unsigned int i, n_str;
@@ -1659,7 +1658,7 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
hist_field = hist_data->fields[i];
if (hist_field->flags & HIST_FIELD_FL_EXECNAME) {
- elt_data->comm = kzalloc(size, GFP_KERNEL);
+ elt_data->comm = kzalloc_obj(*elt_data->comm);
if (!elt_data->comm) {
kfree(elt_data);
return -ENOMEM;
@@ -1677,8 +1676,6 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
BUILD_BUG_ON(STR_VAR_LEN_MAX & (sizeof(u64) - 1));
- size = STR_VAR_LEN_MAX;
-
elt_data->field_var_str = kcalloc(n_str, sizeof(char *), GFP_KERNEL);
if (!elt_data->field_var_str) {
hist_elt_data_free(elt_data);
@@ -1687,7 +1684,7 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
elt_data->n_field_var_str = n_str;
for (i = 0; i < n_str; i++) {
- elt_data->field_var_str[i] = kzalloc(size, GFP_KERNEL);
+ elt_data->field_var_str[i] = kzalloc(STR_VAR_LEN_MAX, GFP_KERNEL);
if (!elt_data->field_var_str[i]) {
hist_elt_data_free(elt_data);
return -ENOMEM;
@@ -3449,7 +3446,7 @@ static bool cond_snapshot_update(struct trace_array *tr, void *cond_data)
elt_data = context->elt->private_data;
track_elt_data = track_data->elt.private_data;
if (elt_data->comm)
- strscpy(track_elt_data->comm, elt_data->comm, TASK_COMM_LEN);
+ track_elt_data->comm = elt_data->comm;
track_data->updated = true;
@@ -5505,16 +5502,13 @@ static void hist_trigger_print_key(struct seq_file *m,
uval, (void *)(uintptr_t)uval);
} else if (key_field->flags & HIST_FIELD_FL_EXECNAME) {
struct hist_elt_data *elt_data = elt->private_data;
- char *comm;
if (WARN_ON_ONCE(!elt_data))
return;
- comm = elt_data->comm;
-
uval = *(u64 *)(key + key_field->offset);
seq_printf(m, "%s: %-16s[%10llu]", field_name,
- comm, uval);
+ elt_data->comm->comm, uval);
} else if (key_field->flags & HIST_FIELD_FL_SYSCALL) {
const char *syscall_name;
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 0d2d3a2ea7dd..e46c5aa0d4e4 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -561,19 +561,19 @@ static void print_graph_cpu(struct trace_seq *s, int cpu)
static void print_graph_proc(struct trace_seq *s, pid_t pid)
{
- char comm[TASK_COMM_LEN];
+ struct trace_comm comm;
/* sign + log10(MAX_INT) + '\0' */
char pid_str[12];
int spaces = 0;
int len;
int i;
- trace_find_cmdline(pid, comm);
- comm[7] = '\0';
+ trace_find_cmdline(pid, &comm);
+ comm.comm[7] = '\0';
sprintf(pid_str, "%d", pid);
/* 1 stands for the "-" character */
- len = strlen(comm) + strlen(pid_str) + 1;
+ len = strlen(comm.comm) + strlen(pid_str) + 1;
if (len < TRACE_GRAPH_PROCINFO_LENGTH)
spaces = TRACE_GRAPH_PROCINFO_LENGTH - len;
@@ -582,7 +582,7 @@ static void print_graph_proc(struct trace_seq *s, pid_t pid)
for (i = 0; i < spaces / 2; i++)
trace_seq_putc(s, ' ');
- trace_seq_printf(s, "%s-%s", comm, pid_str);
+ trace_seq_printf(s, "%s-%s", comm.comm, pid_str);
/* Last spaces to align center */
for (i = 0; i < spaces - (spaces / 2); i++)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..58405291b44f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -554,12 +554,12 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
static int
lat_print_generic(struct trace_seq *s, struct trace_entry *entry, int cpu)
{
- char comm[TASK_COMM_LEN];
+ struct trace_comm comm;
- trace_find_cmdline(entry->pid, comm);
+ trace_find_cmdline(entry->pid, &comm);
trace_seq_printf(s, "%8.8s-%-7d %3d",
- comm, entry->pid, cpu);
+ comm.comm, entry->pid, cpu);
return trace_print_lat_fmt(s, entry);
}
@@ -658,11 +658,11 @@ int trace_print_context(struct trace_iterator *iter)
struct trace_array *tr = iter->tr;
struct trace_seq *s = &iter->seq;
struct trace_entry *entry = iter->ent;
- char comm[TASK_COMM_LEN];
+ struct trace_comm comm;
- trace_find_cmdline(entry->pid, comm);
+ trace_find_cmdline(entry->pid, &comm);
- trace_seq_printf(s, "%16s-%-7d ", comm, entry->pid);
+ trace_seq_printf(s, "%16s-%-7d ", comm.comm, entry->pid);
if (tr->trace_flags & TRACE_ITER(RECORD_TGID)) {
unsigned int tgid = trace_find_tgid(entry->pid);
@@ -700,13 +700,13 @@ int trace_print_lat_context(struct trace_iterator *iter)
entry = iter->ent;
if (verbose) {
- char comm[TASK_COMM_LEN];
+ struct trace_comm comm;
- trace_find_cmdline(entry->pid, comm);
+ trace_find_cmdline(entry->pid, &comm);
trace_seq_printf(
s, "%16s %7d %3d %d %08x %08lx ",
- comm, entry->pid, iter->cpu, entry->flags,
+ comm.comm, entry->pid, iter->cpu, entry->flags,
entry->preempt_count & 0xf, iter->idx);
} else {
lat_print_generic(s, entry, iter->cpu);
@@ -1276,7 +1276,7 @@ static enum print_line_t trace_ctxwake_print(struct trace_iterator *iter,
char *delim)
{
struct ctx_switch_entry *field;
- char comm[TASK_COMM_LEN];
+ struct trace_comm comm;
int S, T;
@@ -1284,7 +1284,7 @@ static enum print_line_t trace_ctxwake_print(struct trace_iterator *iter,
T = task_index_to_char(field->next_state);
S = task_index_to_char(field->prev_state);
- trace_find_cmdline(field->next_pid, comm);
+ trace_find_cmdline(field->next_pid, &comm);
trace_seq_printf(&iter->seq,
" %7d:%3d:%c %s [%03d] %7d:%3d:%c %s\n",
field->prev_pid,
@@ -1293,7 +1293,7 @@ static enum print_line_t trace_ctxwake_print(struct trace_iterator *iter,
field->next_cpu,
field->next_pid,
field->next_prio,
- T, comm);
+ T, comm.comm);
return trace_handle_return(&iter->seq);
}
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index e9f0ff962660..972883643097 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -172,27 +172,33 @@ struct saved_cmdlines_buffer {
unsigned *map_cmdline_to_pid;
unsigned cmdline_num;
int cmdline_idx;
- char saved_cmdlines[];
+ struct trace_comm saved_cmdlines[];
};
static struct saved_cmdlines_buffer *savedcmd;
/* Holds the size of a cmdline and pid element */
#define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s) \
- (TASK_COMM_LEN + sizeof((s)->map_cmdline_to_pid[0]))
+ (sizeof(struct trace_comm) + sizeof((s)->map_cmdline_to_pid[0]))
-static inline char *get_saved_cmdlines(int idx)
+static inline struct trace_comm *get_saved_cmdlines(int idx)
{
- return &savedcmd->saved_cmdlines[idx * TASK_COMM_LEN];
+ return &savedcmd->saved_cmdlines[idx];
}
-static inline void set_cmdline(int idx, const char *cmdline)
+static inline void set_cmdline(int idx, const struct task_struct *tsk)
{
- strscpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN);
+ struct trace_comm *comm = get_saved_cmdlines(idx);
+
+ BUILD_BUG_ON(sizeof(comm->comm) > sizeof(tsk->comm));
+
+ memcpy(comm->comm, tsk->comm, sizeof comm->comm);
+ if (sizeof(comm->comm) != sizeof(tsk->comm))
+ comm->comm[ARRAY_SIZE(comm->comm) - 1] = 0;
}
static void free_saved_cmdlines_buffer(struct saved_cmdlines_buffer *s)
{
- int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN);
+ int order = get_order(sizeof(*s) + s->cmdline_num * sizeof(struct trace_comm));
kmemleak_free(s);
free_pages((unsigned long)s, order);
@@ -222,7 +228,7 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
s->cmdline_num = val;
/* Place map_cmdline_to_pid array right after saved_cmdlines */
- s->map_cmdline_to_pid = (unsigned *)&s->saved_cmdlines[val * TASK_COMM_LEN];
+ s->map_cmdline_to_pid = (unsigned *)&s->saved_cmdlines[val];
memset(&s->map_pid_to_cmdline, NO_CMDLINE_MAP,
sizeof(s->map_pid_to_cmdline));
@@ -273,25 +279,25 @@ int trace_save_cmdline(struct task_struct *tsk)
}
savedcmd->map_cmdline_to_pid[idx] = tsk->pid;
- set_cmdline(idx, tsk->comm);
+ set_cmdline(idx, tsk);
arch_spin_unlock(&trace_cmdline_lock);
return 1;
}
-static void __trace_find_cmdline(int pid, char comm[])
+static void __trace_find_cmdline(int pid, struct trace_comm *comm)
{
unsigned map;
int tpid;
if (!pid) {
- strcpy(comm, "<idle>");
+ strcpy(comm->comm, "<idle>");
return;
}
if (WARN_ON_ONCE(pid < 0)) {
- strcpy(comm, "<XXX>");
+ strcpy(comm->comm, "<XXX>");
return;
}
@@ -300,14 +306,14 @@ static void __trace_find_cmdline(int pid, char comm[])
if (map != NO_CMDLINE_MAP) {
tpid = savedcmd->map_cmdline_to_pid[map];
if (tpid == pid) {
- strscpy(comm, get_saved_cmdlines(map), TASK_COMM_LEN);
+ *comm = *get_saved_cmdlines(map);
return;
}
}
- strcpy(comm, "<...>");
+ strcpy(comm->comm, "<...>");
}
-void trace_find_cmdline(int pid, char comm[])
+void trace_find_cmdline(int pid, struct trace_comm *comm)
{
preempt_disable();
arch_spin_lock(&trace_cmdline_lock);
@@ -561,11 +567,11 @@ static void saved_cmdlines_stop(struct seq_file *m, void *v)
static int saved_cmdlines_show(struct seq_file *m, void *v)
{
- char buf[TASK_COMM_LEN];
+ struct trace_comm buf;
unsigned int *pid = v;
- __trace_find_cmdline(*pid, buf);
- seq_printf(m, "%d %s\n", *pid, buf);
+ __trace_find_cmdline(*pid, &buf);
+ seq_printf(m, "%d %s\n", *pid, buf.comm);
return 0;
}
--
2.39.5
^ permalink raw reply related
* [PATCH 2/2] tracing: Keep pid and comm[] in the same structure
From: David Laight @ 2026-06-26 21:23 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Michal Koutný
Cc: David Laight
In-Reply-To: <20260626212356.64150-1-david.laight.linux@gmail.com>
Rather than have two separate dynamic arrays on the end of struct
saved_commandlines_buffer have a single dynamic array where each
entry contains the pid and associated task->comm[].
This simplifies the initialisation and lookup.
Don't bother trying to initialise the pid field no a non-zero value,
it only matters in the tracing_saved_cmdlines_seq_ops code.
Allocate entry [0] first so that the tracing_saved_cmdlines_seq_ops
code can just index the array with the file offset.
The code now uses the correct size when determining the page 'order'
to free the structure. The smaller size will always give the same
'order'.
Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
Is there any reason why this code uses alloc_pages() rather
than vmalloc()?
map_pid_to_cmdline[] is 64k*sizeof(int) so the whole structure
expands to 512k with about 64k/20 (about 3200) pid entries even
though the default is 128.
AFAICT there is only one copy of the data - so it could be static.
Perhaps with pointers to map_pid_cmdline[] and (after this patch)
pid_comm[], both of which could be separately resized.
I also noticed that map_pid_to_cmdline[] contains indexes into
pid_comm[], restricting these to 16bits would half the data area.
kernel/trace/trace_sched_switch.c | 97 +++++++++++++------------------
1 file changed, 39 insertions(+), 58 deletions(-)
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 972883643097..5e7c8cf444b8 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -167,27 +167,21 @@ static size_t tgid_map_max;
* where interrupt is disabled.
*/
static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED;
+struct pid_comm {
+ pid_t pid;
+ struct trace_comm comm;
+};
struct saved_cmdlines_buffer {
unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1];
- unsigned *map_cmdline_to_pid;
unsigned cmdline_num;
int cmdline_idx;
- struct trace_comm saved_cmdlines[];
+ struct pid_comm pid_comm[];
};
static struct saved_cmdlines_buffer *savedcmd;
-/* Holds the size of a cmdline and pid element */
-#define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s) \
- (sizeof(struct trace_comm) + sizeof((s)->map_cmdline_to_pid[0]))
-
-static inline struct trace_comm *get_saved_cmdlines(int idx)
-{
- return &savedcmd->saved_cmdlines[idx];
-}
-
-static inline void set_cmdline(int idx, const struct task_struct *tsk)
+static inline void set_cmdline(struct pid_comm *pid_comm, const struct task_struct *tsk)
{
- struct trace_comm *comm = get_saved_cmdlines(idx);
+ struct trace_comm *comm = &pid_comm->comm;
BUILD_BUG_ON(sizeof(comm->comm) > sizeof(tsk->comm));
@@ -212,7 +206,7 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
int order;
/* Figure out how much is needed to hold the given number of cmdlines */
- orig_size = sizeof(*s) + val * SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
+ orig_size = sizeof(*s) + val * sizeof(s->pid_comm[0]);
order = get_order(orig_size);
size = 1 << (order + PAGE_SHIFT);
page = alloc_pages(GFP_KERNEL, order);
@@ -224,16 +218,11 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
memset(s, 0, sizeof(*s));
/* Round up to actual allocation */
- val = (size - sizeof(*s)) / SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
+ val = (size - sizeof(*s)) / sizeof(s->pid_comm[0]);
s->cmdline_num = val;
- /* Place map_cmdline_to_pid array right after saved_cmdlines */
- s->map_cmdline_to_pid = (unsigned *)&s->saved_cmdlines[val];
-
memset(&s->map_pid_to_cmdline, NO_CMDLINE_MAP,
sizeof(s->map_pid_to_cmdline));
- memset(s->map_cmdline_to_pid, NO_CMDLINE_MAP,
- val * sizeof(*s->map_cmdline_to_pid));
return s;
}
@@ -247,6 +236,7 @@ int trace_create_savedcmd(void)
int trace_save_cmdline(struct task_struct *tsk)
{
+ struct pid_comm *pid_comm;
unsigned tpid, idx;
/* treat recording of idle task as a success */
@@ -272,14 +262,16 @@ int trace_save_cmdline(struct task_struct *tsk)
idx = savedcmd->map_pid_to_cmdline[tpid];
if (idx == NO_CMDLINE_MAP) {
- idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num;
-
+ idx = savedcmd->cmdline_idx;
savedcmd->map_pid_to_cmdline[tpid] = idx;
+ if (++idx >= savedcmd->cmdline_num)
+ idx = 0;
savedcmd->cmdline_idx = idx;
}
- savedcmd->map_cmdline_to_pid[idx] = tsk->pid;
- set_cmdline(idx, tsk);
+ pid_comm = savedcmd->pid_comm + idx;
+ pid_comm->pid = tsk->pid;
+ set_cmdline(pid_comm, tsk);
arch_spin_unlock(&trace_cmdline_lock);
@@ -288,8 +280,8 @@ int trace_save_cmdline(struct task_struct *tsk)
static void __trace_find_cmdline(int pid, struct trace_comm *comm)
{
+ struct pid_comm *pid_comm;
unsigned map;
- int tpid;
if (!pid) {
strcpy(comm->comm, "<idle>");
@@ -301,12 +293,11 @@ static void __trace_find_cmdline(int pid, struct trace_comm *comm)
return;
}
- tpid = pid & (PID_MAX_DEFAULT - 1);
- map = savedcmd->map_pid_to_cmdline[tpid];
+ map = savedcmd->map_pid_to_cmdline[pid & (PID_MAX_DEFAULT - 1)];
if (map != NO_CMDLINE_MAP) {
- tpid = savedcmd->map_cmdline_to_pid[map];
- if (tpid == pid) {
- *comm = *get_saved_cmdlines(map);
+ pid_comm = savedcmd->pid_comm + map;
+ if (pid_comm->pid == pid) {
+ *comm = pid_comm->comm;;
return;
}
}
@@ -521,42 +512,34 @@ const struct file_operations tracing_saved_tgids_fops = {
.release = seq_release,
};
-static void *saved_cmdlines_next(struct seq_file *m, void *v, loff_t *pos)
+static struct pid_comm *saved_cmdlines_entry(loff_t off)
{
- unsigned int *ptr = v;
+ struct pid_comm *pid_comm;
- if (*pos || m->count)
- ptr++;
-
- (*pos)++;
+ if (off >= savedcmd->cmdline_num)
+ return NULL;
- for (; ptr < &savedcmd->map_cmdline_to_pid[savedcmd->cmdline_num];
- ptr++) {
- if (*ptr == -1 || *ptr == NO_CMDLINE_MAP)
- continue;
+ /* Entries are used in sequence and never freed */
+ pid_comm = &savedcmd->pid_comm[off];
+ return pid_comm->pid ? pid_comm : NULL;
+}
- return ptr;
- }
+static void *saved_cmdlines_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ loff_t off = *pos;
- return NULL;
+ v = saved_cmdlines_entry(off);
+ if (v)
+ *pos = off + 1;
+ return v;
}
static void *saved_cmdlines_start(struct seq_file *m, loff_t *pos)
{
- void *v;
- loff_t l = 0;
-
preempt_disable();
arch_spin_lock(&trace_cmdline_lock);
- v = &savedcmd->map_cmdline_to_pid[0];
- while (l <= *pos) {
- v = saved_cmdlines_next(m, v, &l);
- if (!v)
- return NULL;
- }
-
- return v;
+ return saved_cmdlines_entry(*pos);
}
static void saved_cmdlines_stop(struct seq_file *m, void *v)
@@ -567,11 +550,9 @@ static void saved_cmdlines_stop(struct seq_file *m, void *v)
static int saved_cmdlines_show(struct seq_file *m, void *v)
{
- struct trace_comm buf;
- unsigned int *pid = v;
+ struct pid_comm *ptr = v;
- __trace_find_cmdline(*pid, &buf);
- seq_printf(m, "%d %s\n", *pid, buf.comm);
+ seq_printf(m, "%d %s\n", ptr->pid, ptr->comm.comm);
return 0;
}
--
2.39.5
^ permalink raw reply related
* Re: [PATCH v10 6/6] selftests/mm: add hwpoison-panic destructive test
From: Mike Rapoport @ 2026-06-27 7:52 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-6-6dacb8ad024d@debian.org>
Hi Breno,
On Fri, Jun 26, 2026 at 08:33:20AM -0700, Breno Leitao wrote:
> Add a destructive selftest that verifies
> vm.panic_on_unrecoverable_memory_failure actually panics when a
> hwpoison error hits a kernel-owned page.
> +ksft_skip=4
...
> +ksft_print() { echo "# $*"; }
> +ksft_exit_skip() { ksft_print "$*"; exit "$ksft_skip"; }
> +ksft_exit_fail() { echo "not ok 1 $*"; exit 1; }
There is tools/testing/selftests/kselftest/ktap_helpers.sh that already
implements this :)
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Steven Rostedt @ 2026-06-27 8:36 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Martin Kaiser, Masami Hiramatsu (Google), linux-trace-kernel,
linux-kernel
In-Reply-To: <20260627004538.4cc8b34a4a20ee14f9a38c28@kernel.org>
On June 26, 2026 4:45:38 PM GMT+01:00, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>On Fri, 26 Jun 2026 06:42:23 -0400
>Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> .
>>
>> I'll queue this up.
>
>I've already queued this on my probes/core branch.
>(which will be probes/fixes)
Awesome. Thanks Masami!
-- Steve
^ permalink raw reply
* Re: [PATCH] usb: typec: add trace point for typec_set_mode
From: Steven Rostedt @ 2026-06-27 9:23 UTC (permalink / raw)
To: Heikki Krogerus
Cc: Ahmad Fatoum, Greg Kroah-Hartman, Masami Hiramatsu,
Mathieu Desnoyers, linux-kernel, linux-usb, linux-trace-kernel,
kernel
In-Reply-To: <ajPW9P6qZyhqtjvp@kuha>
On Thu, 18 Jun 2026 14:31:00 +0300
Heikki Krogerus <heikki.krogerus@linux.intel.com> wrote:
> > > obj-$(CONFIG_TYPEC) += typec.o
> > > typec-y := class.o mux.o bus.o pd.o retimer.o mode_selection.o
> > > typec-$(CONFIG_ACPI) += port-mapper.o
> > > +typec-$(CONFIG_TRACING) += trace.o
> >
> > Thanks for the suggestion. I will do that for v2.
> >
> > I also saw there is Sashiko AI feedback on this patch[1], but I am not
> > familiar enough with how the event headers are used outside the kernel
> > to determine if that's actionable advice or if it can be ignored.
> >
> > Do you have an opinion on that?
> >
> > [1]:
> > https://sashiko.dev/#/patchset/20260617-typec_set_mode-tracepoint-v1-1-bdfbb39cfccd%40pengutronix.de
>
> It's correct. You need to use a private trace.h in this case, so just
> move it here: drivers/usb/typec/trace.h
>
> And also make sure you include everything needed in that header like
> it's telling you.
You may need more updates to make this work in a different directory.
Please read the comments in:
samples/trace_events/trace-events-sample.h
and
samples/trace_events/Makefile
-- Steve
^ permalink raw reply
* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
From: Lorenzo Stoakes @ 2026-06-27 9:28 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
Chris Mason, David Sterba, Vlastimil Babka, Steven Rostedt,
Masami Hiramatsu, Rafael J. Wysocki, Oscar Salvador,
Mike Rapoport, linux-doc, linux-btrfs, linux-trace-kernel,
linux-pm, linux-cxl, Linus Torvalds
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>
+cc missing maintainers.
+cc Linus for general issues raised.
(Apologies, this email is VERY long, even for me)
Hi Rik,
I appreciate this is an RFC, but obviously part of that is early
feedback. So if this is meant to be an (EXTREMELY) rough pre-RFC
proof-of-concept then fine (however your 'todo' suggests otherwise).
And of course, before I get critical :) thank you for looking into this,
it's very interesting work. I want this feature to land. BUT :)
So if this in any way resembles what you plan to send upstream un-RFC'd I
need to pour a fairly large bucket of very cold water over this.
TL;DR: The series is completely unmergeable as it stands. Not even close.
Before we get into the code, please please please make sure you cc- the
right people. You're doing very invasive, very major work here.
You failed to even cc- the page allocator maintainer (Vlastimil) for a
series that radically alters page allocation.
MAKE SURE you cc- Vlastimil and all other maintainers and reviewers + lists
on future (RFC!!) revisions of this please.
If you want to do a quiet off-list pre-RFC that's fine, but it's
unforgivable to leave Vlastimil out of this even for that.
It's 30 seconds running a script (filtering for maintainers alone to save
on noise):
$ scripts/get_maintainer.pl --nogit --nogit-fallback --nor 20260520150018.2491267-1-riel@surriel.com.mbx
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - CORE)
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - CORE)
Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
David Sterba <dsterba@suse.com> (maintainer:BTRFS FILE SYSTEM)
Vlastimil Babka <vbabka@kernel.org> (maintainer:MEMORY MANAGEMENT - PAGE ALLOCATOR)
Steven Rostedt <rostedt@goodmis.org> (maintainer:TRACING)
Masami Hiramatsu <mhiramat@kernel.org> (maintainer:TRACING)
"Rafael J. Wysocki" <rafael@kernel.org> (maintainer:HIBERNATION (aka Software Suspend, aka swsusp))
Oscar Salvador <osalvador@suse.de> (maintainer:MEMORY HOT(UN)PLUG)
Mike Rapoport <rppt@kernel.org> (maintainer:MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION)
Johannes Weiner <hannes@cmpxchg.org> (maintainer:MEMORY MANAGEMENT - RECLAIM)
linux-mm@kvack.org (open list:MEMORY MANAGEMENT - CORE)
linux-doc@vger.kernel.org (open list:DOCUMENTATION)
linux-kernel@vger.kernel.org (open list)
linux-btrfs@vger.kernel.org (open list:BTRFS FILE SYSTEM)
linux-trace-kernel@vger.kernel.org (open list:TRACING)
linux-pm@vger.kernel.org (open list:HIBERNATION (aka Software Suspend, aka swsusp))
linux-cxl@vger.kernel.org (open list:MEMORY HOT(UN)PLUG)
HIBERNATION (aka Software Suspend, aka swsusp) status: Supported
SUSPEND TO RAM status: Supported
OK, so getting into the code. You didn't include an overall diffstat. So let's
see:
$ git checkout e1914add2799
...
$ b4 shazam 20260520150018.2491267-1-riel@surriel.com
...
$ git format-patch --cover-letter HEAD~40
...
$ tail -31 0000-cover-letter.patch
Documentation/admin-guide/sysctl/vm.rst | 21 -
Documentation/mm/physical_memory.rst | 13 +-
fs/btrfs/extent_io.c | 69 +-
fs/btrfs/extent_io.h | 4 +-
fs/btrfs/inode.c | 2 +-
fs/btrfs/ioctl.c | 2 +-
fs/btrfs/raid56.c | 6 +-
fs/btrfs/relocation.c | 2 +-
fs/btrfs/scrub.c | 3 +-
include/linux/mmzone.h | 236 +-
include/linux/page-flags.h | 9 +
include/linux/pageblock-flags.h | 10 +
include/linux/vm_event_item.h | 10 +
include/trace/events/kmem.h | 373 ++
kernel/power/snapshot.c | 35 +-
mm/compaction.c | 360 +-
mm/debug.c | 19 +-
mm/internal.h | 39 +
mm/memory_hotplug.c | 4 +
mm/mm_init.c | 400 +-
mm/page_alloc.c | 5064 ++++++++++++++++++++---
mm/page_reporting.c | 149 +-
mm/show_mem.c | 27 +-
mm/sparse.c | 3 +-
mm/vmscan.c | 115 +-
mm/vmstat.c | 71 +-
26 files changed, 6131 insertions(+), 915 deletions(-)
This is completely insane :) I do hope the omission of the diffstat was not
to hide this, but at any rate, adding ~5,000 lines of code to page_alloc.c
is a complete non-starter.
In any case I think clearly more files are required, mm/super_pageblock.c
or whatever it'll be, but at any rate this much code being added to me
clearly indicates something is terribly wrong here anyway.
And you added no tests whatsoever. I get that some things are hard to test
in the kernel and mm, but not even an attempt? Surely you have some
localised stuff that you've been using to exercise this, couldn't some of
that be made into selftests?
Certainly if tests are not present, this needs to be justified in the cover
letter.
(Again I appreciate this is an RFC, so perhaps those are coming.)
Anyway let's look at some of this code. Johannes's patches all look
reasonable, but as soon as we start seeing:
Assisted-by: Claude:claude-opus-4.7
Things start to go awry.
The series is full of extremely dense comments that contain far too much
specific information to be parseable. E.g.:
+ * - owner_cpu == this CPU (or no owner): take the local PCP
+ * lock with spin_trylock and enqueue normally. The trylock
+ * fails only on rare local self re-entry (IRQ/NMI fires
+ * while the interrupted task already holds the lock) or
+ * while a remote drain is active; either way, fall back to
+ * free_one_page (or the zone-llist for FPI_TRYLOCK). No
+ * irqsave: the trylock cannot block on self, and remote
+ * CPUs never take this pcp->lock (they go via free_llist),
+ * so an interruption cannot deadlock against another freer.
This isn't readable. This isn't acceptable. It matches exactly what I've
found LLMs to do, adding far too much detail in way that isn't
human-readable.
You must redo all of these.
The patches are all quite dense. 40 patches is already far too many IMO for
a series this massive, but you're also piling in tonnes of complexity in
every patch.
The series should really be broken out into separate series that each
slowly build up the prerequisites, so they are reviewable, parseable, can
be seen in action before we move to the superblock concept and each having
tests or means of asserting that they in fact function.
It feels that there are patches that indeed can be broken out sensibly like
this,
e.g. https://lore.kernel.org/all/20260520150018.2491267-5-riel@surriel.com/
(which also seems to be a fix? So maybe should be sent as one with
appropriate tags?), so I think that's really the only viable way to land
something this huge.
I see Usama commented on you breaking userspace, and you said you are not
going to do that after all (thanks)... but it worries me that you
essentially render the watermark boost broken? I don't think that's OK?
The code in general dumps large blocks of code into already existing huge
functions but, even more unforgivably, turns small, maintainable, easily
understood functions into completely disgusting horror shows.
For instance, __rmqueue_smallest() goes from ~20 lines to 604 lines. This
is completely insane and totally unacceptable.
The code is also full of software engineering 101 fails. For instance in
__rmqueue_sb_find_fallback():
if (search_cats & SB_SEARCH_PREFERRED) {
...
for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
list_for_each_entry(sb,
&zone->spb_lists[cat][full], list) {
...
if (movable && cat == SB_TAINTED &&
sb->nr_free <= spb_tainted_reserve(sb))
continue;
...
for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
...
if (page) {
...
}
}
}
}
}
Note the mix of insane nesting AND guard clauses to make it even more
indecipherable.
There's no world in which I, or any other mm maintainer (I would venture to
say) want to maintain stuff like this.
A particularly concerning patch is
https://lore.kernel.org/all/20260520150018.2491267-15-riel@surriel.com/ -
The commit messages suffer from the same issue as the comments - incredibly
dense material that is far too detailed to the point of absolutely
clarifying nothing. E.g.:
"The SPB list heads (zone->spb_empty and the spb_lists[cat][full] matrix)
are initialized only by setup_superpageblocks(), which is __init and runs
only at boot. Hot-add into a previously-empty zone invokes
init_one_superpageblock() with zero-initialized list_heads, and the
inlined list_add_tail() NULL-derefs walking ->next->prev."
Is just complete word salad. It may as well be written in Egyptian
hieroglyphs.
You're explaining spaghetti code using English language which makes it even
more indecipherable, and not getting to the root of what you are actually
doing.
Again this feels very LLM-generated. I've noticed that they tend to do this
'write out the code again but in English' stuff (and is why I do not
allow LLMs to generate code or comments for me!)
You need to make commit messages and comments as succinct as possible and
as clear as possible for _human beings_ :)
I also notice you have this huge blocks of word salad comments all over the
place but I haven't seen a single kdoc comment (EDIT: noticed at least 1 later!)
This is a far more helpful, structured, form of describing functions and
for anything put in a header you really do need to provide them.
Back to the patch. The diffstat is:
include/linux/mmzone.h | 10 +
mm/compaction.c | 36 +-
mm/internal.h | 10 +
mm/mm_init.c | 146 +++++--
mm/page_alloc.c | 853 ++++++++++++++++++++++++++++++++---------
mm/vmstat.c | 66 ++--
6 files changed, 883 insertions(+), 238 deletions(-)
This is (generally speaking, and certainly in this case) far, far too much
for a single patch.
But as per above this is the patch where you commit what are essentially
code war crimes against __rmqueue_smallest(), adding code so densely nested
that you have to lop off the end of code to fit:
+ if (!movable && !is_migrate_cma(migratetype)) {
+ for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+ list_for_each_entry(sb,
+ &zone->spb_lists[SB_TAINTED][full], list) {
+ if (!sb->nr_free)
+ continue;
+ for (current_order = max_t(unsigned int,
+ order, pageblock_order);
+ current_order < NR_PAGE_ORDERS;
+ ++current_order) {
+ area = &sb->free_area[current_order];
+ page = get_page_from_free_area(
+ area, MIGRATE_MOVABLE);
+ if (!page)
+ continue;
+ if (get_pageblock_isolate(page))
+ continue;
+ if (is_migrate_cma(
+ get_pageblock_migratetype(page)))
+ continue;
+ page = claim_whole_block(zone, page,
+ current_order, order,
+ migratetype, MIGRATE_MOVABLE);
+ trace_mm_page_alloc_zone_locked(
+ page, order, migratetype,
+ pcp_allowed_order(order) &&
+ migratetype < MIGRATE_PCPTYPES);
+ return page;
+ }
+ }
+ }
+ }
I mean in what world are we taking code like this?
https://lore.kernel.org/all/20260520150018.2491267-22-riel@surriel.com/ has
what seem to be more human comments (good!) but then so densely nested that
they have to be cropped (bad).
using smaller, separate, functions and a structured programming approach
here is the way to go.
There's random smaller issues too like
https://lore.kernel.org/all/20260520150018.2491267-23-riel@surriel.com/
seems to assume the compiler can't figure out to remove dead code (note the
wild inconsistency too in comments)
You seem in some commits to undo or correct stuff you did in previous ones
like
https://lore.kernel.org/all/20260520150018.2491267-24-riel@surriel.com/
I mean perhaps I'm mistaken, but otherwise, can you instead rebase your
series please?
And again we see the world salad:
+ * Maximum tainted superpageblock candidates per spb_evacuate_for_order call.
+ * Collected under zone->lock, then evacuated without it. Larger than the
+ * contig-allocation candidate cap because evacuation runs from the slowpath
+ * after reclaim/compaction failed: we need a meaningful chance of freeing a
+ * non-MOV-claimable pageblock before the slowpath escalates to dropping
+ * ALLOC_NOFRAGMENT (which lets __rmqueue_claim taint clean SPBs). Sized to
+ * scan a meaningful fraction of a typical tainted-pool population.
No kdocs, some functions/complicated logic missing any comments at all,
then piles of words thrown at you like a rock.
This isn't pleasant to read, it adds no clarity, it just makes the whole
thing a mess.
The code in that one is at least vaguely ok (though still too nested).
I feel like where you've manually written code the quality substantially
improves, where the LLM has, the quality nosedives into oblivion.
For instance
https://lore.kernel.org/all/20260520150018.2491267-26-riel@surriel.com/
seems to be radically better, albeit adding far too much code. So I assume
that was more manual.
Stuff like
https://lore.kernel.org/all/20260520150018.2491267-27-riel@surriel.com/
shows more of a structural issue.
In:
+ * Mark callers that have a cheap fallback if the page allocator returns
+ * NULL, so __rmqueue can refuse to taint a clean SPB when an existing
+ * tainted SPB still has free pageblocks waiting to be evacuated.
+ *
+ * Two shapes qualify:
+ *
+ * 1. Explicit fallback declaration: __GFP_NORETRY without
+ * __GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
+ * skb_page_frag_refill on full sockets, etc.
+ *
+ * 2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no __GFP_NOMEMALLOC,
+ * no __GFP_NOFAIL. These callers (GFP_ATOMIC, GFP_NOWAIT, including
+ * ALLOC_HIGHATOMIC consumers) have implicit fallbacks: drop the
+ * packet, demote the slab order, return ENOMEM up the slowpath,
+ * retry from process context with GFP_KERNEL, etc. ALLOC_HIGHATOMIC
+ * callers also get a second crack at the dedicated MIGRATE_HIGHATOMIC
+ * reserve in rmqueue_buddy after __rmqueue returns NULL.
+ * Tainting a 1 GiB SPB to satisfy any of them is a long-lived
+ * fragmentation event for short-lived data.
+ *
The comment is vastly better than most, but you seem to be tying far too
much up in assumptions about what particular workloads do.
This is indicative perhaps of a need to refactor to more reasonably
determine these.
Adding a function and state, say, that expresses these properties would
allow you to break out these concepts and have the code be self-documenting
(as well as adding suitable actual documentation in comments that would be
more succinct).
https://lore.kernel.org/all/20260520150018.2491267-29-riel@surriel.com/
also shows some real issues with how you've implemented this, e.g.:
+ * Called from each PASS_1/2/2B/2C/2D success path after a successful
+ * allocation against a tainted SPB. If the SPB is below its shrink
+ * high-water mark, queue the SPB-driven slab shrink and try to start
+ * the per-SPB defrag worker. Both have their own cooldown gates inside,
+ * so this is cheap to call on every such allocation.
Now you're putting information from one gargantuan function with labels
about 'passes' (rather reminiscent of VMA merge 'cases') into another.
This is a clear sign of a broken abstraction and the code not being
structured correctly.
You should _express_ state encoded in these 'passes' in the _actual code_,
break that code up sensibly and in a way that can be assessed for
correctness, not put a bit-rotting comment on top of a function whose
correctness is much harder to confirm.
Again this patch has similar issues with ridiculously dense indecipherable
comments, e.g.:
+ * Last-chance defrag trigger before tainting a fresh clean SPB.
+ * Walk the tainted-SPB list and try to wake the per-SPB defrag
+ * worker on each. Catches SPBs that are stuck in expired-cooldown
+ * state because no allocator activity has touched them recently
+ * (the routine event-driven trigger from spb_update_list only
+ * fires on bucket transitions, not on every alloc). Once the
+ * cooldown has expired, spb_maybe_start_defrag() will requeue
+ * work; otherwise the gate inside spb_needs_defrag() no-ops
+ * cheaply. Bounded by nr_tainted_spbs and only runs when we are
+ * already on the slow path of fragmenting the clean pool.
The spoondecker is montiplexed in the fradupple complex dedadderated in the
splunkyfied concratanator underfined by the transpontaculatoration
matrixifier... :)
I think this is symptomatic of the abstraction being fundamentally broken.
If you have to establish vast swathes of cognitive context like this mid
way through a huge function (again poor __rmqueue_smallest() who will never
forgive you), then you've basically failed to abstract it.
Please I beg you ADD FUNCTIONS :) structured programming is a thing,
struct's are a thing, abstraction is a thing :)
Pouring spaghetti into a single function is something you expect to see on
a 1990's PHP website, not in core mm code ;)
https://lore.kernel.org/all/20260520150018.2491267-30-riel@surriel.com/
seems to be another example of you just rethinking parts of what you
already submitted midway through the series.
Again I might be missing something here, but if you are doing this, please
just rebase your series to use $NEWIDEA from the start.
https://lore.kernel.org/all/20260520150018.2491267-35-riel@surriel.com/ is
another wall of text (newlines! Please :) but it seems like something you
can break out separately no?
It seems at least reviewable :)
https://lore.kernel.org/all/20260520150018.2491267-36-riel@surriel.com/
contains more of the same broken abstraction stuff, terrible nesting stuff,
word salad stuff but is at least mercifully smaller.
https://lore.kernel.org/all/20260520150018.2491267-37-riel@surriel.com/ is
almost emblematic of the terrible (LLM I hope?) comment issue in the
series. Overly dense hieroglyphs.
https://lore.kernel.org/all/20260520150018.2491267-38-riel@surriel.com/ is
another 'why didn't you rebase?' patch (again maybe I'm missing something
here!)
https://lore.kernel.org/all/20260520150018.2491267-39-riel@surriel.com/
seems more reasonable although newlines please :)
And finally
https://lore.kernel.org/all/20260520150018.2491267-40-riel@surriel.com/ -
while it's a do-not-merge patch in an RFC that changes tracing so maybe
unfair, but the comments are suffering the same symptoms as the rest of the
series.
~
So really, you need to start again, from scratch, and without the use of an
LLM for generating code, or at least with it kept on a (very very short)
leash.
And to be clear, I _want_ this concept of GB superpageblocks to land. It's
a really exciting concept.
Pulling compaction kicking and screaming into 2026 stands to significantly
benefit linux users and developers.
But the execution has to be _completely_ rethought.
I also worry about correctness - I simply cannot see how you can have sense
of it, given the state of the code. For something so invasive and so
critical to kernel functionality, code quality is simply not optional here.
Practical thoughts on how to rework the series:
- Properly abstract the concepts
- Properly separate out functions and data structures
- Add _human-readable_ comments that are succinct + clear as possible
- If a function becomes too nested, separate into smaller functions
- Add tests, if at all or in any way possible (and justify if you can't)
- Clarify commit messages, don't just rewrite code in English
- Use newlines for new paragraphs everywhere :)
- Refactor existing code in preparation for your changes
- Add a new file to contain your changes (+ add to page alloc MAINTAINERS
section)
- Add kdocs for anything not static, and clearly describe static functions
- Split into separate series as much as possible, gradually building
foundations for your changes
- Make everything less dense and more abstracted
In general, write the series with reviewers and other kernel developers in
mind - write clear explanations in comments and commit messages, have the
series slowly build to what you're implementing.
That way, we can see that correctness is maintained throughout, can review
what's there sensibly, and the series becomes upstreamable.
IOW I say we take off and nuke the entire site from orbit. It's the only
way to be sure :)
~
Another issue here is maintainer time - even this _extremely_ light-touch
review has taken me a few hours (of my weekend :). To review it in detail
would take probably DAYS of dedicated work.
In general, I'm concerned that we're going to get caught in a cycle of LLM
code sent - reviewers spending hours reviewing - LLM generates new revision
- etc.
This simply isn't scalable, maintainers cannot do this in the face of the
sheer quantity and complexity of code that LLMs can generate.
We are simply going to have to NAK. And that helps nobody.
Luckily for me, this isn't in a sub-(sub-)system I maintain, so I am not
obliged, but I do have empathy for my fellow maintainers, and am VERY
concerned about this trend.
There has recently been an absolute wave of LLM code, some acked as such
(and I think you for doing so here!) but others unacknowledged entirely,
and the workload, which was already too much, has risen significantly (Jon
has noted the rise in commit count for instance in LWN).
Treating maintainer time as without value was already an issue, but I fear
that we'll see a significant increase in maintainer stress and sadly,
burnout.
As such, I feel that we will have to implement measures at some point to
deprioritise/quickly dismiss such series, unfortunately.
But series from smart and capable engineers such as yourselfk who are well
respected in the communityk will likely not bek and thus will suffer from
this issue indefinitely.
So on that basis, I ask respectfully that you account for this when using
this tooling.
(Also, I appreciate that this is an RFC, but your recent non-RFC GUP series
https://lore.kernel.org/all/20260616190300.1509639-1-riel@surriel.com/
(revision v2 at
https://lore.kernel.org/all/20260625015053.2445008-1-riel@surriel.com/) has
all the same hallmarks as this one, so I feel this point needs to be
underlined).
~
Speaking more generally across the industry, I've been reading about
companies generating code blindly and running into terrible problems with
software that ends up becoming totally unmanageable.
I won't tolerate this happening happening in mm, and strongly object to the
concept (held by some AI proponents) that code is simply an unimportant
byproduct of AI.
The kernel's code (and especially mm's) is _critically_ important, quite
literally, as it forms the basis of the world's critical technical
infrastructure.
I am not opposed to the use of LLMs, but they _must_ serve as tools to
_assist_ experts at their job, not a means by which code bases are
degraded.
~
OK with all that said - to be absolutely clear - I respect you a great
deal, and I KNOW you're (much, much) better than this.
And, to repeat, this idea is very exciting and I _want_ to see this land.
But I feel you've rather let the LLM run amok and it's selling you (very,
very) short, given just how smart and capable you are.
Let's try again :)
Thanks, Lorenzo
^ permalink raw reply
* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
From: Rik van Riel @ 2026-06-27 13:36 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
Chris Mason, David Sterba, Vlastimil Babka, Steven Rostedt,
Masami Hiramatsu, Rafael J. Wysocki, Oscar Salvador,
Mike Rapoport, linux-doc, linux-btrfs, linux-trace-kernel,
linux-pm, linux-cxl, Linus Torvalds
In-Reply-To: <aj9yrlB0TrlYCLlf@lucifer>
On Sat, 2026-06-27 at 10:28 +0100, Lorenzo Stoakes wrote:
>
> So really, you need to start again, from scratch, and without the use
> of an
> LLM for generating code, or at least with it kept on a (very very
> short)
> leash.
>
> And to be clear, I _want_ this concept of GB superpageblocks to land.
> It's
> a really exciting concept.
That is the one reason I sent out RFC code before it
is ready. I am looking for feedback on the concepts
in this series.
How do people feel about splitting up the free lists,
so each gigabyte (well, PUD sized) chunk of memory
has its own free lists?
How can we balance the desire for higher-order kernel
allocations, against the desire to preserve gigabyte
sized chunks of memory that can be used for user space?
>
> Pulling compaction kicking and screaming into 2026 stands to
> significantly
> benefit linux users and developers.
That's another big question. How do we balance the
desire to keep compaction overhead low with the desire
to do higher order allocations almost everywhere?
>
> But the execution has to be _completely_ rethought.
There's no argument there.
I am just hoping to figure out what I should be
doing on a conceptual level, before figuring out
how to do it cleanly.
The mess in the RFC is the result of trying something
that seemed right, watching it fail in some subtle
way, and trying to fix it up.
Once I know what I need to do, coming up with a
cleaner implementation is very doable.
>
> IOW I say we take off and nuke the entire site from orbit. It's the
> only
> way to be sure :)
>
BOOM?
> Another issue here is maintainer time - even this _extremely_ light-
> touch
> review has taken me a few hours (of my weekend :). To review it in
> detail
> would take probably DAYS of dedicated work.
I suspect there is a mismatch in expectations here.
I already knew this code has to be totally redone.
I was looking for feedback on the basic concepts
and design in the patch series, but failed to
clearly communicate that.
You provided some detailed feedback on the code,
but as of yet nobody has really provided any
opinions on things like whether it is desirable
at all to have the free lists per gigablock,
or whether we need to come up with some totally
different approach.
How do we better communicate that kind of thing
in the future?
Is that something to spell out more clearly in
the cover letter?
Is that kind of feedback something developers
could even reasonably ask for? (if not, how do
we figure out what maintainers want?)
--
All Rights Reversed.
^ permalink raw reply
page: | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox