Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

get_any_page() collapses three different failure modes into a single
-EIO return:

  * the put_page race in the !count_increased path;
  * the HWPoisonHandlable() rejection that bounces out of
    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
  * the HWPoisonHandlable() rejection that goes through the
    count_increased / put_page / shake_page retry loop.

The first is transient (the page is racing with the allocator).  The
second can be either transient (a userspace folio briefly off LRU
during migration/compaction) or stable (slab/vmalloc/page-table/
kernel-stack pages).  The third describes a stable kernel-owned page
that the count_increased=true caller already held a reference on.

Distinguish them on the return path: keep -EIO for both the put_page
race and the -EBUSY-after-retries branch (shake_page() cannot drag a
folio back from active migration, so we cannot prove the page is
permanently kernel-owned from there), keep -EBUSY for the allocation
race (unchanged), and return -ENOTRECOVERABLE only from the
count_increased-true HWPoisonHandlable() rejection that exhausts its
retries -- the caller's reference is structural evidence that the
page is owned by the kernel.

Extend the unhandlable-page pr_err() to fire for either errno and
update the get_hwpoison_page() kerneldoc.

memory_failure() still folds every negative return into
MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
this patch is a no-op for users of memory_failure() and only changes
the errno that soft_offline_page() can propagate to its callers.  A
follow-up wires the new return code through memory_failure() and
reports MF_MSG_KERNEL for the unrecoverable cases.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 49bcfbd04d213..bae883df3ccb2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
 				shake_page(p);
 				goto try_again;
 			}
+			/*
+			 * Return -EIO rather than -ENOTRECOVERABLE: this
+			 * branch is also reached for pages that are merely
+			 * off-LRU transiently (e.g. a folio in the middle
+			 * of migration or compaction), which shake_page()
+			 * cannot drag back.  The caller cannot prove the
+			 * page is permanently kernel-owned from here, so
+			 * keep it on the recoverable errno.
+			 */
 			ret = -EIO;
 			goto out;
 		}
@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
 			goto try_again;
 		}
 		put_page(p);
-		ret = -EIO;
+		ret = -ENOTRECOVERABLE;
 	}
 out:
-	if (ret == -EIO)
+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));

 	return ret;
@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
  *         -EIO for pages on which we can not handle memory errors,
  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
  *         operations like allocation and free,
- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
+ *         kernel stacks, and similar non-LRU/non-buddy pages).
  */
 static int get_hwpoison_page(struct page *p, unsigned long flags)
 {

-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH v7 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team, Lance Yang

A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running.  The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash.  In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.

This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context.  The default is disabled so existing workloads see no
change.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
  get_any_page() and surface it via -ENOTRECOVERABLE, per David
  Hildenbrand's and Lance Yang's review of v6.  This drops the
  is_reserved snapshot in memory_failure() and the mf_get_page_status
  enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
  over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
  matching tracepoint string; the enum now covers both PG_reserved
  pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
  and 2/4 ("classify get_any_page() failures by reason") into a
  single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
  (David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org

Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org

Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
  unrecoverable kernel page hwpoison events (reserved pages, refcount-0
  non-buddy pages, unknown state), with a recheck to avoid racing with
  concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
  document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
  and add a selftest verifying SIGBUS recovery on userspace pages still
  works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
  https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org

Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (6):
      mm/memory-failure: drop dead error_states[] entry for reserved pages
      mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
      mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
      mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl

 Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++
 mm/memory-failure.c                     | 85 +++++++++++++++++++++++++--------
 2 files changed, 146 insertions(+), 19 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* [PATCH v7 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

The first entry of error_states[],

	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },

is unreachable.  identify_page_state() has two callers, and neither
one can dispatch a PG_reserved page to me_kernel():

  * memory_failure() reaches identify_page_state() only after
    get_hwpoison_page() returned 1.  get_any_page() reaches that
    return only via __get_hwpoison_page(), which gates the refcount
    on HWPoisonHandlable().  HWPoisonHandlable() rejects PG_reserved
    pages, so they fail with -EBUSY/-EIO long before
    identify_page_state() runs.

  * try_memory_failure_hugetlb() reaches identify_page_state() on
    the MF_HUGETLB_IN_USED branch, but the page is necessarily a
    hugetlb folio there.  The first table entry that matches a
    hugetlb folio is { head, head, MF_MSG_HUGE, me_huge_page }, so
    they dispatch to me_huge_page() before the (now-removed)
    reserved entry would have matched, regardless of whether
    PG_reserved happens to be set on the head page.

me_kernel() never executes and the entry exists only to be matched
against by code that cannot see it.

Drop the entry, the me_kernel() helper, and the now-unused
"reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
remains part of the tracepoint and pr_err() string tables, and
follow-on work to classify unrecoverable kernel pages can reuse it
without churning the user-visible enum.

No functional change.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 866c4428ac7ef..49bcfbd04d213 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -992,17 +992,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
 	return false;
 }
 
-/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page_state *ps, struct page *p)
-{
-	unlock_page(p);
-	return MF_IGNORED;
-}
-
 /*
  * Page in unknown state. Do nothing.
  * This is a catch-all in case we fail to make sense of the page state.
@@ -1211,10 +1200,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 #define mlock		(1UL << PG_mlocked)
 #define lru		(1UL << PG_lru)
 #define head		(1UL << PG_head)
-#define reserved	(1UL << PG_reserved)
 
 static struct page_state error_states[] = {
-	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
 	 * PG_buddy pages only make a small fraction of all free pages.
@@ -1246,7 +1233,6 @@ static struct page_state error_states[] = {
 #undef mlock
 #undef lru
 #undef head
-#undef reserved
 
 static void update_per_node_mf_stats(unsigned long pfn,
 				     enum mf_result result)

-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH v6 4/7] locking: Factor out queued_spin_release()
From: Steven Rostedt @ 2026-05-13 15:37 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
	linux-arch, linux-mm, linux-trace-kernel, kernel-team,
	Paul E. McKenney
In-Reply-To: <64c202b8a76a7d98515cf10cc1f99ecb0a9a7ccf.1777999826.git.d@ilvokhin.com>

On Tue,  5 May 2026 17:09:33 +0000
Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> Introduce queued_spin_release() as an arch-overridable unlock primitive,
> and make queued_spin_unlock() a generic wrapper around it.
> 
> This is a preparatory refactoring for the next commit, which adds
> contended_release tracepoint instrumentation to queued_spin_unlock().

In change logs, do not use "next commit". Instead say something like:

  In preparation for adding contended_release tracepoint instrumentation to
  queued_spin_unlock(), refactor the code to allow out of line calls when
  the tracepoint is enabled.

Or something like that.

-- Steve



> 
> Rename the existing arch-specific queued_spin_unlock() overrides on
> x86 (paravirt) and MIPS to queued_spin_release().
> 
> No functional change.
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

^ permalink raw reply

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
From: Jesper Dangaard Brouer @ 2026-05-13 15:32 UTC (permalink / raw)
  To: Dmitry Ilvokhin, Vlastimil Babka (SUSE)
  Cc: Andrew Morton, Matthew Wilcox, linux-mm, Steven Rostedt,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, David Hildenbrand,
	Lorenzo Stoakes, Shuah Khan, linux-kernel, linux-trace-kernel,
	kernel-team
In-Reply-To: <af4mbDB3-ohO5DCp@shell.ilvokhin.com>

On 08/05/2026 20.07, Dmitry Ilvokhin wrote:
> On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
>> On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
>>> On 5/8/26 7:29 PM, Andrew Morton wrote:
>>>> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
>>>>
>>>>> Add tracepoints to the page allocator fast paths that acquire
>>>>> zone->lock, allowing diagnosis of lock contention in production.
>>>>
>>>> Thanks, I'm surprised we haven't done this yet.
>>>
>>> There was a recent attempt [1]. Not being a generic solution wasn't welcome.
>>>
>>> [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
>>
>> And this is the generic solution I think?
>>
>> https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
> 
> Thanks for cc'ing me, Vlastimil.
> 
> Yes, this is an attempt at a generic solution for tracing contended
> locks, including spinlocks, so it should also cover the use case
> proposed in this patchset.
>

I'm aware of the generic solution and often use `perf lock contention`.
And the tool libbpf-tools/klockstat. My experience is unfortunately that
enabling these tracepoint is prohibitive expensive on production server,
and production suffers when I run these tools.

I'm very happy to see a patchset adding a contended case. But I worry
that tracing all contented locks in the system is also too much to have
enabled continuously for production.

This patch is carefully constructed to minimize overhead, such that I
can enable this continuously on production to catch issues.  If I
identify issue I will use the generic tracpoints for further debugging.

> In fact, zone->lock contention was one of the primary motivations for
> this work.

In the generic solution I'm loosing the "zone" and pages "count".  I
need this information to get the answers I'm looking for.  Specifically
I'm looking at reducing CONFIG_PCP_BATCH_SCALE_MAX, but I want to this
to be a data-driven decision (my first principle is: if you cannot
measure it you cannot improve it).

I'm likely going to apply this patch to our production system, such that
I can get my data-driven decision.  I need to deploy it widely enough to
get enough server experiencing direct-reclaim.  I'll report back if
people are interested in these learning?

--Jesper

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Steven Rostedt @ 2026-05-13 15:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <CAADnVQL_sWznA+JJLdzP_ZdUgQeO7p-AGnOtx9=fXjH+PnRJBA@mail.gmail.com>

On Wed, 13 May 2026 08:16:07 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> It's impossible to track all modifications.
> See what sched-ext is doing.
> What does it modify? Everything.

What about just having a list of what BPF programs are loaded, what they
may be attached to, and what kfuncs they are calling?

-- Steve

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Alexei Starovoitov @ 2026-05-13 15:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <20260513111331.7bede512@gandalf.local.home>

On Wed, May 13, 2026 at 8:13 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 3 May 2026 21:51:49 +0200
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > Nack.
> >
> > Please stop this spam.
> > We're not doing it. These helpers have been around for a long time.
> > There was no need to taint then. There is no need to taint now.
>
> Hi Alexei,
>
> I'm wondering if there's a way to see what modifications BPF programs are
> doing to the kernel? I try to make it easy to see what modifications ftrace
> has done (like the enabled_functions file), because I like to know how my
> kernel is modified since boot up.

It's impossible to track all modifications.
See what sched-ext is doing.
What does it modify? Everything.

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Steven Rostedt @ 2026-05-13 15:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <CAADnVQJ5fatNF4auH+a8E39zWMfja3rm4BM_xGcTnLX8uuCQ9Q@mail.gmail.com>

On Sun, 3 May 2026 21:51:49 +0200
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Nack.
> 
> Please stop this spam.
> We're not doing it. These helpers have been around for a long time.
> There was no need to taint then. There is no need to taint now.

Hi Alexei,

I'm wondering if there's a way to see what modifications BPF programs are
doing to the kernel? I try to make it easy to see what modifications ftrace
has done (like the enabled_functions file), because I like to know how my
kernel is modified since boot up.

Thus, it would be nice to know if BPF is modifying anything in user space
or just what BPF programs are loaded.

Note, I'm agnostic to this change, it just brought up a previous concern of
mine when I read it.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: Breno Leitao @ 2026-05-13 15:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <3e5e8fb8-6957-46b7-9777-0ed1bff1d0fb@kernel.org>

On Wed, May 13, 2026 at 01:48:11PM +0200, David Hildenbrand (Arm) wrote:
> > @@ -1441,10 +1456,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> >  			goto try_again;
> >  		}
> >  		put_page(p);
> > -		ret = -EIO;
> > +		ret = -ENOTRECOVERABLE;
> >  	}
> >  out:
> > -	if (ret == -EIO)
> > +	if (ret == -EIO || ret == -ENOTRECOVERABLE)
> >  		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> >  
> >  	return ret;
> > @@ -2431,6 +2448,9 @@ int memory_failure(unsigned long pfn, int flags)
> >  			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> >  		}
> >  		goto unlock_mutex;
> > +	} else if (res == -ENOTRECOVERABLE) {
> > +		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> > +		goto unlock_mutex;
> >  	} else if (res < 0) {
> >  		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> >  		goto unlock_mutex;
> 
> That might probably read nicer as
> 
> switch (res) {
> case 0: ...
> case 1: ...
> case -ENOTRECOVERABLE:  ...
> case ...
> default:
> }
> 
> > 
> > 
> > If that is what you are suggestion, maybe we can create another
> > MF_MSG_RESERVED? and another return value for get_any_page() to track
> > the reserve pages ?
> 
> I guess "reserved" is really just like most other kernel pages. So I wouldn't
> special-case them here.
> 
> Or would there be a good reason?

Not really, treating them as MF_MSG_KERNEL is sufficient for my use
case.

Thank you for the review. I'm digesting all the feedback and will send
out a new revision shortly, where we can continue the discussion.

--breno

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Drop invalid top-level local in test_ownership
From: Steven Rostedt @ 2026-05-13 15:02 UTC (permalink / raw)
  To: CaoRuichuang, Shuah Khan
  Cc: mhiramat, mathieu.desnoyers, shuah, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260407203727.442b583c@gandalf.local.home>


Shuah,

Can you pull this into your urgent branch?

Thanks,

-- Steve


On Tue, 7 Apr 2026 20:37:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Shuah,
> 
> Care to take this through your tree. Probably could even add:
> 
> Cc: stable@vger.kernel.org
> Fixes: 8b55572e51805 ("tracing/selftests: Add tracefs mount options test")
> 
> As well as:
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> -- Steve
> 
> 
> On Tue,  7 Apr 2026 18:26:13 +0800
> CaoRuichuang <create0818@163.com> wrote:
> 
> > From: Cao Ruichuang <create0818@163.com>
> > 
> > test_ownership.tc is sourced by ftracetest under /bin/sh.
> > 
> > The script currently declares mount_point with local at file scope,
> > which makes /bin/sh abort with "local: not in a function" before the
> > test can reach the eventfs ownership checks.
> > 
> > Replace the top-level local declaration with a normal shell variable so
> > kernels that support the gid= tracefs mount option can run the test at
> > all.
> > 
> > Signed-off-by: Cao Ruichuang <create0818@163.com>
> > ---
> >  tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > index e71cc3ad0..6d00d3c0f 100644
> > --- a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > +++ b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > @@ -6,7 +6,7 @@
> >  original_group=`stat -c "%g" .`
> >  original_owner=`stat -c "%u" .`
> >  
> > -local mount_point=$(get_mount_point)
> > +mount_point=$(get_mount_point)
> >  
> >  mount_options=$(get_mnt_options "$mount_point")
> >    
> 


^ permalink raw reply

* Re: [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
From: Jesper Dangaard Brouer @ 2026-05-13 15:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Lorenzo Stoakes, Shuah Khan, linux-kernel,
	linux-trace-kernel, kernel-team
In-Reply-To: <d440b00b-6d7b-48cb-b37d-43f6f885ed01@kernel.org>



On 08/05/2026 22.15, David Hildenbrand (Arm) wrote:
> On 5/8/26 18:22, hawk@kernel.org wrote:
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> Add a selftest to verify the kmem:mm_zone_lock_contended,
>> kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
>>
>> The test has two components:
>>
>> zone_lock_contention.c - a workload that spawns threads doing rapid
>> page allocation and freeing to generate zone->lock contention. It
>> shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
>> free_pcppages_bulk() and rmqueue_bulk() calls.
>>
>> test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
>> exist, have the expected fields, fire under load, and that wait_ns
>> is populated when contention occurs.
>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   tools/testing/selftests/mm/Makefile           |   2 +
>>   .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
>>   .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++
> 
> This really looks excessive and ... not really how we usually treat tracepoints?
> 
> I don't know about others, but I don't think this is really what we want as a MM
> selftest.
> 

I wanted to have a program that tested the code I changed, so I simply
made AI write a verification test and asked it to create as a selftest,
that I've run to verify my code change was correctly implemented.  As I
needed to trigger lock contention the test is more advanced, but luckily
AI solved it in only the 2nd attempt.

It makes sense to drop this patch. We shouldn't keep this code in the
kernel tree, it simply verified that my code works. There is little
chance that this test will catch meaningful regressions for these
tracepoints.

--Jesper

^ permalink raw reply

* Re: [PATCH] tracing: samples: avoid warning about __aeabi_unwind_cpp_pr1
From: Steven Rostedt @ 2026-05-13 14:59 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: Arnd Bergmann, Masami Hiramatsu, Nathan Chancellor, Marc Zyngier,
	Arnd Bergmann, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260323105646.590718-1-arnd@kernel.org>


Vincent,

Is this patch needed? That is, did it fall through the cracks?

-- Steve

On Mon, 23 Mar 2026 11:56:41 +0100
Arnd Bergmann <arnd@kernel.org> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
> 
> The now more verbose check found another symbol missing from the whitelist:
> 
> Unexpected symbols in kernel/trace/simple_ring_buffer.o:
>          U __aeabi_unwind_cpp_pr1
> 
> Add this to the Makefile.
> 
> Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  kernel/trace/Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index d662c1a64cd5..aba6a25db17b 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -169,8 +169,8 @@ targets += undefsyms_base.o
>  # because it is not linked into vmlinux.
>  KASAN_SANITIZE_undefsyms_base.o := y
>  
> -UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
> -		      __msan simple_ring_buffer \
> +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
> +		      __x86_indirect_thunk __aeabi_unwind_cpp simple_ring_buffer \
>  		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
>  
>  quiet_cmd_check_undefined = NM      $<


^ permalink raw reply

* Re: [PATCH v4 2/3] fgraph: Enhance funcgraph-retval with BTF-based type-aware output
From: Steven Rostedt @ 2026-05-13 14:47 UTC (permalink / raw)
  To: Donglin Peng
  Cc: mhiramat, linux-trace-kernel, bpf, linux-kernel, pengdonglin,
	Xiaoqin Zhang
In-Reply-To: <20251215034153.2367756-3-dolinux.peng@gmail.com>

On Mon, 15 Dec 2025 11:41:52 +0800
Donglin Peng <dolinux.peng@gmail.com> wrote:

> From: pengdonglin <pengdonglin@xiaomi.com>
> 
> The current funcgraph-retval implementation suffers from two accuracy
> issues:
> 
> 1. Void-returning functions still print a return value, creating
>    misleading noise in the trace output.
> 
> 2. For functions returning narrower types (e.g., char, short), the
>    displayed value can be incorrect because high bits of the register
>    may contain undefined data.
> 
> This patch addresses both problems by leveraging BTF to obtain the exact
> return type of each traced kernel function. The key changes are:
> 
> 1. Void function filtering: Functions with void return type no longer
>    display any return value in the trace output, eliminating unnecessary
>    clutter.
> 
> 2. Type-aware value formatting: The return value is now properly truncated
>    to match the actual width of the return type before being displayed.
>    Additionally, the value is formatted according to its type for better
>    human readability.
> 
> Here is an output comparison:
> 
> Before:
>  # perf ftrace -G vfs_read --graph-opts retval
>  ...
>  1)               |   touch_atime() {
>  1)               |     atime_needs_update() {
>  1)   0.069 us    |       make_vfsuid(); /* ret=0x0 */
>  1)   0.067 us    |       make_vfsgid(); /* ret=0x0 */
>  1)               |       current_time() {
>  1)   0.197 us    |         ktime_get_coarse_real_ts64_mg(); /* ret=0x187f886aec3ed6f5 */
>  1)   0.352 us    |       } /* current_time ret=0x69380753 */
>  1)   0.792 us    |     } /* atime_needs_update ret=0x0 */
>  1)   0.937 us    |   } /* touch_atime ret=0x0 */
> 
> After:
>  # perf ftrace -G vfs_read --graph-opts retval
>  ...
>  2)               |   touch_atime() {
>  2)               |     atime_needs_update() {
>  2)   0.070 us    |       make_vfsuid(); /* ret=0x0 */
>  2)   0.070 us    |       make_vfsgid(); /* ret=0x0 */
>  2)               |       current_time() {
>  2)   0.162 us    |         ktime_get_coarse_real_ts64_mg();
>  2)   0.312 us    |       } /* current_time ret=0x69380649(trunc) */
>  2)   0.753 us    |     } /* atime_needs_update ret=false */
>  2)   0.899 us    |   } /* touch_atime */
> 
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com>
> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
> ---
>  kernel/trace/trace_functions_graph.c | 124 ++++++++++++++++++++++++---
>  1 file changed, 111 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> index 17c75cf2348e..46b66b1cfc16 100644
> --- a/kernel/trace/trace_functions_graph.c
> +++ b/kernel/trace/trace_functions_graph.c
> @@ -15,6 +15,7 @@
>  
>  #include "trace.h"
>  #include "trace_output.h"
> +#include "trace_btf.h"
>  
>  /* When set, irq functions might be ignored */
>  static int ftrace_graph_skip_irqs;
> @@ -120,6 +121,13 @@ enum {
>  	FLAGS_FILL_END   = 3 << TRACE_GRAPH_PRINT_FILL_SHIFT,
>  };
>  
> +enum {
> +	RETVAL_FMT_HEX   = BIT(0),
> +	RETVAL_FMT_DEC   = BIT(1),
> +	RETVAL_FMT_BOOL  = BIT(2),
> +	RETVAL_FMT_TRUNC = BIT(3),
> +};
> +
>  static void
>  print_graph_duration(struct trace_array *tr, unsigned long long duration,
>  		     struct trace_seq *s, u32 flags);
> @@ -865,6 +873,73 @@ static void print_graph_retaddr(struct trace_seq *s, struct fgraph_retaddr_ent_e
>  
>  #if defined(CONFIG_FUNCTION_GRAPH_RETVAL) || defined(CONFIG_FUNCTION_GRAPH_RETADDR)
>  
> +static void trim_retval(unsigned long func, unsigned long *retval, bool *print_retval,
> +			int *fmt)

This function should really be in trace_btf.c and a stub when btf is not
enabled.

-- Steve

> +{
> +	const struct btf_type *t;
> +	char name[KSYM_NAME_LEN];
> +	struct btf *btf;
> +	u32 v, msb;
> +	int kind;
> +
> +	if (!IS_ENABLED(CONFIG_DEBUG_INFO_BTF))
> +		return;
> +
> +	if (lookup_symbol_name(func, name))
> +		return;
> +
> +	t = btf_find_func_proto(name, &btf);
> +	if (IS_ERR_OR_NULL(t))
> +		return;
> +
> +	t = btf_type_skip_modifiers(btf, t->type, NULL);
> +	kind = t ? BTF_INFO_KIND(t->info) : BTF_KIND_UNKN;
> +	switch (kind) {
> +	case BTF_KIND_UNKN:
> +		*print_retval = false;
> +		break;
> +	case BTF_KIND_STRUCT:
> +	case BTF_KIND_UNION:
> +	case BTF_KIND_ENUM:
> +	case BTF_KIND_ENUM64:
> +		if (kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION)
> +			*fmt = RETVAL_FMT_HEX;
> +		else
> +			*fmt = RETVAL_FMT_DEC;
> +
> +		if (t->size > sizeof(unsigned long)) {
> +			*fmt |= RETVAL_FMT_TRUNC;
> +		} else {
> +			msb = BITS_PER_BYTE * t->size - 1;
> +			*retval &= GENMASK(msb, 0);
> +		}
> +		break;
> +	case BTF_KIND_INT:
> +		v = *(u32 *)(t + 1);
> +		if (BTF_INT_ENCODING(v) == BTF_INT_BOOL) {
> +			*fmt = RETVAL_FMT_BOOL;
> +			msb = 0;
> +		} else {
> +			if (BTF_INT_ENCODING(v) == BTF_INT_SIGNED)
> +				*fmt = RETVAL_FMT_DEC;
> +			else
> +				*fmt = RETVAL_FMT_HEX;
> +
> +			if (t->size > sizeof(unsigned long)) {
> +				*fmt |= RETVAL_FMT_TRUNC;
> +				msb = BITS_PER_LONG - 1;
> +			} else {
> +				msb = BTF_INT_BITS(v) - 1;
> +			}
> +		}
> +		*retval &= GENMASK(msb, 0);
> +		break;
> +	default:
> +		*fmt = RETVAL_FMT_HEX;
> +		break;
> +	}
> +}
> +

^ permalink raw reply

* Re: [PATCH v4 1/3] ftrace: Build trace_btf.c when CONFIG_DEBUG_INFO_BTF is enabled
From: Steven Rostedt @ 2026-05-13 14:42 UTC (permalink / raw)
  To: Donglin Peng
  Cc: mhiramat, linux-trace-kernel, bpf, linux-kernel, pengdonglin,
	Xiaoqin Zhang
In-Reply-To: <20251215034153.2367756-2-dolinux.peng@gmail.com>


Sorry for the late reply, I've been a bit busy on other things recently.

On Mon, 15 Dec 2025 11:41:51 +0800
Donglin Peng <dolinux.peng@gmail.com> wrote:

> From: pengdonglin <pengdonglin@xiaomi.com>
> 
> The trace_btf.c file provides BTF helper functions used by the ftrace
> subsystem. This change makes its compilation solely dependent on

Nit, change logs should never say "This change". Instead it should be
worded as:

  "Make the compilation of trace_btf.c soley depend on..."

-- Steve

> CONFIG_DEBUG_INFO_BTF, allowing features like funcgraph-retval to also
> utilize these helpers.
> 
> Additionally, the redundant dependency on CONFIG_PROBE_EVENTS_BTF_ARGS
> is removed, as CONFIG_DEBUG_INFO_BTF already depends on
> CONFIG_BPF_SYSCALL.
> 
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com>
> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
> ---
>  kernel/trace/Kconfig  | 2 +-
>  kernel/trace/Makefile | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index e1214b9dc990..653c1fcefa4c 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -755,7 +755,7 @@ config FPROBE_EVENTS
>  config PROBE_EVENTS_BTF_ARGS
>  	depends on HAVE_FUNCTION_ARG_ACCESS_API
>  	depends on FPROBE_EVENTS || KPROBE_EVENTS
> -	depends on DEBUG_INFO_BTF && BPF_SYSCALL
> +	depends on DEBUG_INFO_BTF
>  	bool "Support BTF function arguments for probe events"
>  	default y
>  	help
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index fc5dcc888e13..6c4bf5a6c4f3 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -116,7 +116,7 @@ obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
>  endif
>  obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o
>  obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
> -obj-$(CONFIG_PROBE_EVENTS_BTF_ARGS) += trace_btf.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += trace_btf.o
>  obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o
>  obj-$(CONFIG_BOOTTIME_TRACING) += trace_boot.o
>  obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 14:01 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>


Sorry for the spam, I accidentally mapped ctrl+Enter to send without
confirmation and can't teach my fingers not to press it instead of ctrl+delete..

> That's definitely a good addition, kmalloc_nolock was not that good already so
> I tried some way have a preallocation, though I realise it isn't really
> flexible.
> 
> Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?
> 
> Isn't this similar to what you'd do with a kmem_cache. That was my original
> idea although that uses spinlocks too.
> 
> I quickly tried an implementation like yours using
> mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and
> it still explodes with my monitors, but perhaps now that tracepoints no longer
> disable preemption it could play well with some monitors.
> 
> The selftests with tlob seem to work just the same with this kmem_cache (up to
> the unrelated RCU stall). To be fair since you only allocate from the uprobe
> handler, you'd probably be just fine with kmalloc_nolock, but let's continue
> with the preallocation logic.
> 
> 
> The API is starting to get complex (well, not that it wasn't already).
> We have essentially 3 ways to allocate:
>  * fully automatic with kmalloc_nolock
>  * semi-automatic with pool preallocation
>  * manual with direct storage preallocation
> 
> We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
> DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
> DA_MON_POOL_SIZE to be define (force that with an #error).

Anyway, this way you probably wouldn't need to define a different init function
and let everything handled more transparently.

Also you don't need to call da_create_or_get() explicitly,
da_handle_start_event() should do it for you.

The only manual step required is da_create_storage() when you explicitly cannot
lock when calling da_handle_start_event() (that would be DA_ALLOC_MANUAL, you
don't need that).

Hope I didn't create too much confusion.

Thanks,
Gabriele


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 13:50 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_create_empty_storage() uses kmalloc_nolock(), which requires
> CONFIG_HAVE_ALIGNED_STRUCT_PAGE; on UML and some PREEMPT_RT
> configurations it always returns NULL.  Calling kmalloc from scheduler
> tracepoint handlers also adds unwanted latency and can fail under
> memory pressure.
> 
> Add da_monitor_init_prealloc(N) as an opt-in alternative to
> da_monitor_init().  It allocates N da_monitor_storage slots with
> GFP_KERNEL up-front and manages them on a LIFO free-stack protected
> by a spinlock, so da_create_or_get() never calls kmalloc on the hot
> path.
> 
> Monitors that do not call da_monitor_init_prealloc() are unaffected.

That's definitely a good addition, kmalloc_nolock was not that good already so I
tried some way have a preallocation, though I realise it isn't really flexible.

Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?

Isn't this similar to what you'd do with a kmem_cache. That was my original idea
although that uses spinlocks too.

I quickly tried an implementation like yours using
mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and it
still explodes with my monitors, but perhaps now that tracepoints no longer
disable preemption it could play well with some monitors.

The selftests with tlob seem to work just the same with this kmem_cache (up to
the unrelated RCU stall). To be fair since you only allocate from the uprobe
handler, you'd probably be just fine with kmalloc_nolock, but let's continue
with the preallocation logic.


The API is starting to get complex (well, not that it wasn't already).
We have essentially 3 ways to allocate:
 * fully automatic with kmalloc_nolock
 * semi-automatic with pool preallocation
 * manual with direct storage preallocation

We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_MON_AUTO, DA_MON_POOL,
DA_MON_MANUAL} where DA_MON_POOL also requires DA_MON_POOL_SIZE to be define
(force that with an #error).

> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  include/rv/da_monitor.h | 208 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 186 insertions(+), 22 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index d04bb3229c75..7d6f62766251 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -433,18 +433,6 @@ static inline da_id_type da_get_id(struct da_monitor
> *da_mon)
>   return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->id;
>  }
>  
> -/*
> - * da_create_or_get - create the per-object storage if not already there
> - *
> - * This needs a lookup so should be guarded by RCU, the condition is checked
> - * directly in da_create_storage()
> - */
> -static inline void da_create_or_get(da_id_type id, monitor_target target)
> -{
> - guard(rcu)();
> - da_create_storage(id, target, da_get_monitor(id, target));
> -}
> -
>  /*
>   * da_fill_empty_storage - store the target in a pre-allocated storage
>   *
> @@ -475,15 +463,121 @@ static inline monitor_target
> da_get_target_by_id(da_id_type id)
>   return mon_storage->target;
>  }
>  
> +/*
> + * Per-object pool state.
> + *
> + * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
> + * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
> + * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
> + *
> + * Because every field is wrapped in this struct and the struct itself is a
> + * per-TU static, each monitor that includes this header gets a completely
> + * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
> + * (e.g. tlob) therefore coexist without any interference.
> + *
> + * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
> + * required to prevent deadlock with task-context callers.  On PREEMPT_RT
> + * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
> + */
> +struct da_per_obj_pool {
> + struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
> + struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
> + unsigned int                free_top;
> + spinlock_t                  lock;
> +};
> +
> +static struct da_per_obj_pool da_pool = {
> + .lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> +};
> +
> +static void da_pool_return_cb(struct rcu_head *head)
> +{
> + struct da_monitor_storage *ms =
> + container_of(head, struct da_monitor_storage, rcu);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + da_pool.free[da_pool.free_top++] = ms;
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +}
> +
> +/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
> +static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
> +{
> + struct da_monitor_storage *mon_storage;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + if (!da_pool.free_top) {
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> + return -ENOSPC;
> + }
> + mon_storage = da_pool.free[--da_pool.free_top];
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +
> + mon_storage->id = id;
> + mon_storage->target = target;
> + guard(rcu)();
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + return 0;
> +}
> +
> +/*
> + * Tries da_create_storage() first (lock-free via kmalloc_nolock); falls back
> + * to kmalloc(GFP_KERNEL).  Must be called from task context.
> + */
> +static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target
> target)
> +{
> + struct da_monitor_storage *mon_storage;
> +
> + scoped_guard(rcu) {
> + if (da_create_storage(id, target, da_get_monitor(id, target)))
> + return 0;
> + }
> +
> + /*
> + * da_create_storage() failed because kmalloc_nolock() returned NULL.
> + * Allocate with GFP_KERNEL outside the RCU read section: GFP_KERNEL
> + * may sleep for memory reclaim, which is illegal while the RCU read
> + * lock is held (preemption disabled on !PREEMPT_RT).
> + */
> + mon_storage = kmalloc_obj(*mon_storage, GFP_KERNEL | __GFP_ZERO);
> + if (!mon_storage)
> + return -ENOMEM;
> + mon_storage->id = id;
> + mon_storage->target = target;
> +
> + /*
> + * Re-check for a concurrent insertion before linking: another
> + * caller may have succeeded while we slept in kmalloc().
> + * Discard our allocation and let the winner's entry stand.
> + */
> + scoped_guard(rcu) {
> + if (da_get_monitor(id, target)) {
> + kfree(mon_storage);
> + return 0;
> + }
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + }
> + return 0;
> +}
> +
> +/* Create the per-object storage if not already there. */
> +static inline int da_create_or_get(da_id_type id, monitor_target target)
> +{
> + if (da_pool.storage)
> + return da_create_or_get_pool(id, target);
> + return da_create_or_get_kmalloc(id, target);
> +}
> +
>  /*
>   * da_destroy_storage - destroy the per-object storage
>   *
> - * The caller is responsible to synchronise writers, either with locks or
> - * implicitly. For instance, if da_destroy_storage is called at sched_exit
> and
> - * da_create_storage can never occur after that, it's safe to call this
> without
> - * locks.
> - * This function includes an RCU read-side critical section to synchronise
> - * against da_monitor_destroy().
> + * Pool mode: removes from hash and returns the slot via call_rcu().
> + * Kmalloc mode: removes from hash and frees via kfree_rcu().
> + *
> + * Includes an RCU read-side critical section to synchronise against
> + * da_monitor_destroy().
>   */
>  static inline void da_destroy_storage(da_id_type id)
>  {
> @@ -491,15 +585,17 @@ static inline void da_destroy_storage(da_id_type id)
>  
>   guard(rcu)();
>   mon_storage = __da_get_mon_storage(id);
> -
>   if (!mon_storage)
>   return;
>   da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   hash_del_rcu(&mon_storage->node);
> - kfree_rcu(mon_storage, rcu);
> + if (da_pool.storage)
> + call_rcu(&mon_storage->rcu, da_pool_return_cb);
> + else
> + kfree_rcu(mon_storage, rcu);
>  }
>  
> -static void da_monitor_reset_all(void)
> +static __maybe_unused void da_monitor_reset_all(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   int bkt;
> @@ -510,13 +606,65 @@ static void da_monitor_reset_all(void)
>   rcu_read_unlock();
>  }
>  
> +/*
> + * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
> + *
> + * Allocates @prealloc_count storage slots up-front so that
> da_create_or_get()
> + * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
> + * of da_monitor_init() for monitors that require pool mode.
> + */
> +static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
> +{
> + hash_init(da_monitor_ht);
> +
> + da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
> +   GFP_KERNEL);
> + if (!da_pool.storage)
> + return -ENOMEM;
> +
> + da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
> +      GFP_KERNEL);
> + if (!da_pool.free) {
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + return -ENOMEM;
> + }
> +
> + da_pool.free_top = 0;
> + for (unsigned int i = 0; i < prealloc_count; i++)
> + da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
> + return 0;
> +}
> +
> +/*
> + * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
> + */
>  static inline int da_monitor_init(void)
>  {
>   hash_init(da_monitor_ht);
>   return 0;
>  }
>  
> -static inline void da_monitor_destroy(void)
> +static inline void da_monitor_destroy_pool(void)
> +{
> + WARN_ON_ONCE(!hash_empty(da_monitor_ht));
> + /*
> + * Wait for all in-flight da_pool_return_cb() callbacks to
> + * complete before freeing da_pool.free.  synchronize_rcu() is
> + * not sufficient: it only waits for callbacks registered before
> + * it was called, but call_rcu() from concurrent da_destroy_storage()
> + * calls may have been enqueued later.  rcu_barrier() drains every
> + * pending callback.
> + */
> + rcu_barrier();
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + kfree(da_pool.free);
> + da_pool.free = NULL;
> + da_pool.free_top = 0;
> +}
> +
> +static inline void da_monitor_destroy_kmalloc(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   struct hlist_node *tmp;
> @@ -534,6 +682,22 @@ static inline void da_monitor_destroy(void)
>   }
>  }
>  
> +/*
> + * da_monitor_destroy - tear down the per-object monitor
> + *
> + * Pool mode: the hash must already be empty (caller must have drained all
> + * tasks first); calls rcu_barrier() to drain all pending da_pool_return_cb()
> + * callbacks before freeing pool arrays.
> + * Kmalloc mode: drains any remaining entries after synchronize_rcu().
> + */
> +static inline void da_monitor_destroy(void)
> +{
> + if (da_pool.storage)
> + da_monitor_destroy_pool();
> + else
> + da_monitor_destroy_kmalloc();
> +}
> +
>  /*
>   * Allow the per-object monitors to run allocation manually, necessary if the
>   * start condition is in a context problematic for allocation (e.g.
> scheduling).


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 13:47 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_create_empty_storage() uses kmalloc_nolock(), which requires
> CONFIG_HAVE_ALIGNED_STRUCT_PAGE; on UML and some PREEMPT_RT
> configurations it always returns NULL.  Calling kmalloc from scheduler
> tracepoint handlers also adds unwanted latency and can fail under
> memory pressure.
> 
> Add da_monitor_init_prealloc(N) as an opt-in alternative to
> da_monitor_init().  It allocates N da_monitor_storage slots with
> GFP_KERNEL up-front and manages them on a LIFO free-stack protected
> by a spinlock, so da_create_or_get() never calls kmalloc on the hot
> path.
> 
> Monitors that do not call da_monitor_init_prealloc() are unaffected.

That's definitely a good addition, kmalloc_nolock was not that good already so I
tried some way have a preallocation, though I realise it isn't really flexible.

Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?

Isn't this similar to what you'd do with a kmem_cache. That was my original idea
although that uses spinlocks too.

I quickly tried an implementation like yours using
mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and it
still explodes with my monitors, but perhaps now that tracepoints no longer
disable preemption it could play well with some monitors.

The selftests with tlob seem to work just the same with this kmem_cache (up to
the unrelated RCU stall). To be fair since you only allocate from the uprobe
handler, you'd probably be just fine with kmalloc_nolock, but let's ignore this
for now.


That said, the API is starting to get complex (well, not that it wasn't
already).
We have essentially 3 ways to allocate:
 * fully automatic with kmalloc_nolock
 * semi-automatic with pool preallocation
 * manual with direct storage preallocation

> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  include/rv/da_monitor.h | 208 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 186 insertions(+), 22 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index d04bb3229c75..7d6f62766251 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -433,18 +433,6 @@ static inline da_id_type da_get_id(struct da_monitor
> *da_mon)
>   return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->id;
>  }
>  
> -/*
> - * da_create_or_get - create the per-object storage if not already there
> - *
> - * This needs a lookup so should be guarded by RCU, the condition is checked
> - * directly in da_create_storage()
> - */
> -static inline void da_create_or_get(da_id_type id, monitor_target target)
> -{
> - guard(rcu)();
> - da_create_storage(id, target, da_get_monitor(id, target));
> -}
> -
>  /*
>   * da_fill_empty_storage - store the target in a pre-allocated storage
>   *
> @@ -475,15 +463,121 @@ static inline monitor_target
> da_get_target_by_id(da_id_type id)
>   return mon_storage->target;
>  }
>  
> +/*
> + * Per-object pool state.
> + *
> + * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
> + * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
> + * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
> + *
> + * Because every field is wrapped in this struct and the struct itself is a
> + * per-TU static, each monitor that includes this header gets a completely
> + * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
> + * (e.g. tlob) therefore coexist without any interference.
> + *
> + * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
> + * required to prevent deadlock with task-context callers.  On PREEMPT_RT
> + * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
> + */
> +struct da_per_obj_pool {
> + struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
> + struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
> + unsigned int                free_top;
> + spinlock_t                  lock;
> +};
> +
> +static struct da_per_obj_pool da_pool = {
> + .lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> +};
> +
> +static void da_pool_return_cb(struct rcu_head *head)
> +{
> + struct da_monitor_storage *ms =
> + container_of(head, struct da_monitor_storage, rcu);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + da_pool.free[da_pool.free_top++] = ms;
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +}
> +
> +/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
> +static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
> +{
> + struct da_monitor_storage *mon_storage;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + if (!da_pool.free_top) {
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> + return -ENOSPC;
> + }
> + mon_storage = da_pool.free[--da_pool.free_top];
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +
> + mon_storage->id = id;
> + mon_storage->target = target;
> + guard(rcu)();
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + return 0;
> +}
> +
> +/*
> + * Tries da_create_storage() first (lock-free via kmalloc_nolock); falls back
> + * to kmalloc(GFP_KERNEL).  Must be called from task context.
> + */
> +static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target
> target)
> +{
> + struct da_monitor_storage *mon_storage;
> +
> + scoped_guard(rcu) {
> + if (da_create_storage(id, target, da_get_monitor(id, target)))
> + return 0;
> + }
> +
> + /*
> + * da_create_storage() failed because kmalloc_nolock() returned NULL.
> + * Allocate with GFP_KERNEL outside the RCU read section: GFP_KERNEL
> + * may sleep for memory reclaim, which is illegal while the RCU read
> + * lock is held (preemption disabled on !PREEMPT_RT).
> + */
> + mon_storage = kmalloc_obj(*mon_storage, GFP_KERNEL | __GFP_ZERO);
> + if (!mon_storage)
> + return -ENOMEM;
> + mon_storage->id = id;
> + mon_storage->target = target;
> +
> + /*
> + * Re-check for a concurrent insertion before linking: another
> + * caller may have succeeded while we slept in kmalloc().
> + * Discard our allocation and let the winner's entry stand.
> + */
> + scoped_guard(rcu) {
> + if (da_get_monitor(id, target)) {
> + kfree(mon_storage);
> + return 0;
> + }
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + }
> + return 0;
> +}
> +
> +/* Create the per-object storage if not already there. */
> +static inline int da_create_or_get(da_id_type id, monitor_target target)
> +{
> + if (da_pool.storage)
> + return da_create_or_get_pool(id, target);
> + return da_create_or_get_kmalloc(id, target);
> +}
> +
>  /*
>   * da_destroy_storage - destroy the per-object storage
>   *
> - * The caller is responsible to synchronise writers, either with locks or
> - * implicitly. For instance, if da_destroy_storage is called at sched_exit
> and
> - * da_create_storage can never occur after that, it's safe to call this
> without
> - * locks.
> - * This function includes an RCU read-side critical section to synchronise
> - * against da_monitor_destroy().
> + * Pool mode: removes from hash and returns the slot via call_rcu().
> + * Kmalloc mode: removes from hash and frees via kfree_rcu().
> + *
> + * Includes an RCU read-side critical section to synchronise against
> + * da_monitor_destroy().
>   */
>  static inline void da_destroy_storage(da_id_type id)
>  {
> @@ -491,15 +585,17 @@ static inline void da_destroy_storage(da_id_type id)
>  
>   guard(rcu)();
>   mon_storage = __da_get_mon_storage(id);
> -
>   if (!mon_storage)
>   return;
>   da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   hash_del_rcu(&mon_storage->node);
> - kfree_rcu(mon_storage, rcu);
> + if (da_pool.storage)
> + call_rcu(&mon_storage->rcu, da_pool_return_cb);
> + else
> + kfree_rcu(mon_storage, rcu);
>  }
>  
> -static void da_monitor_reset_all(void)
> +static __maybe_unused void da_monitor_reset_all(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   int bkt;
> @@ -510,13 +606,65 @@ static void da_monitor_reset_all(void)
>   rcu_read_unlock();
>  }
>  
> +/*
> + * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
> + *
> + * Allocates @prealloc_count storage slots up-front so that
> da_create_or_get()
> + * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
> + * of da_monitor_init() for monitors that require pool mode.
> + */
> +static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
> +{
> + hash_init(da_monitor_ht);
> +
> + da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
> +   GFP_KERNEL);
> + if (!da_pool.storage)
> + return -ENOMEM;
> +
> + da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
> +      GFP_KERNEL);
> + if (!da_pool.free) {
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + return -ENOMEM;
> + }
> +
> + da_pool.free_top = 0;
> + for (unsigned int i = 0; i < prealloc_count; i++)
> + da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
> + return 0;
> +}
> +
> +/*
> + * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
> + */
>  static inline int da_monitor_init(void)
>  {
>   hash_init(da_monitor_ht);
>   return 0;
>  }
>  
> -static inline void da_monitor_destroy(void)
> +static inline void da_monitor_destroy_pool(void)
> +{
> + WARN_ON_ONCE(!hash_empty(da_monitor_ht));
> + /*
> + * Wait for all in-flight da_pool_return_cb() callbacks to
> + * complete before freeing da_pool.free.  synchronize_rcu() is
> + * not sufficient: it only waits for callbacks registered before
> + * it was called, but call_rcu() from concurrent da_destroy_storage()
> + * calls may have been enqueued later.  rcu_barrier() drains every
> + * pending callback.
> + */
> + rcu_barrier();
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + kfree(da_pool.free);
> + da_pool.free = NULL;
> + da_pool.free_top = 0;
> +}
> +
> +static inline void da_monitor_destroy_kmalloc(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   struct hlist_node *tmp;
> @@ -534,6 +682,22 @@ static inline void da_monitor_destroy(void)
>   }
>  }
>  
> +/*
> + * da_monitor_destroy - tear down the per-object monitor
> + *
> + * Pool mode: the hash must already be empty (caller must have drained all
> + * tasks first); calls rcu_barrier() to drain all pending da_pool_return_cb()
> + * callbacks before freeing pool arrays.
> + * Kmalloc mode: drains any remaining entries after synchronize_rcu().
> + */
> +static inline void da_monitor_destroy(void)
> +{
> + if (da_pool.storage)
> + da_monitor_destroy_pool();
> + else
> + da_monitor_destroy_kmalloc();
> +}
> +
>  /*
>   * Allow the per-object monitors to run allocation manually, necessary if the
>   * start condition is in a context problematic for allocation (e.g.
> scheduling).


^ permalink raw reply

* Re: [RFC PATCH v2 01/10] rv/da: fix monitor start ordering and memory ordering for monitoring flag
From: Gabriele Monaco @ 2026-05-13 12:39 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8af5ba4bd93d2acb8a546e8e47ced974a87c1eb8.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
> may racing with the sched_switch handler:
> 
>   da_monitor_start()               sched_switch handler
>   -------------------------        ---------------------------------
>   da_mon->monitoring = 1;
>                                    if (da_monitoring(da_mon))  /* true  */
>                                        ha_start_timer_ns(...);
>                                        /* hrtimer->base == NULL, crash */
>   da_monitor_init_hook(da_mon);
>   /* hrtimer_setup() sets base */
> 
> Fix the ordering and pair with release/acquire semantics:
> 
>   da_monitor_init_hook(da_mon);
>   smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
>   return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */
> 
> On ARM64 a plain STR + LDR does not form a release-acquire pair, so
> the load can observe monitoring=1 while hrtimer->base is still NULL.
> The plain accesses are also data races under KCSAN.
> 
> Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
> cover the reset path.
> 
> Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor
> definition via C macros")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Thanks for the fix!

There are probably more than a few of those bugs since most monitors are
implicitly serialised because their events are serialised..

Looks good to me.

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

See minor comments below:

> ---
>  include/rv/da_monitor.h | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 39765ff6f098..00ded3d5ab3f 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -82,7 +82,7 @@ static void react(enum states curr_state, enum events event)
>  static inline void da_monitor_reset(struct da_monitor *da_mon)
>  {
>   da_monitor_reset_hook(da_mon);
> - da_mon->monitoring = 0;
> + WRITE_ONCE(da_mon->monitoring, 0);
>   da_mon->curr_state = model_get_initial_state();
>  }
>  
> @@ -95,8 +95,9 @@ static inline void da_monitor_reset(struct da_monitor
> *da_mon)
>  static inline void da_monitor_start(struct da_monitor *da_mon)
>  {
>   da_mon->curr_state = model_get_initial_state();
> - da_mon->monitoring = 1;
>   da_monitor_init_hook(da_mon);
> + /* Pairs with smp_load_acquire in da_monitoring(). */

I wonder if these comment are really adding value, pairing smp_load_acquire /
smp_store_release is the by-the-book usage and everything is here.

But feel free to leave it if you think it's clearer.

Thanks,
Gabriele

> + smp_store_release(&da_mon->monitoring, 1);
>  }
>  
>  /*
> @@ -104,7 +105,8 @@ static inline void da_monitor_start(struct da_monitor
> *da_mon)
>   */
>  static inline bool da_monitoring(struct da_monitor *da_mon)
>  {
> - return da_mon->monitoring;
> + /* Pairs with smp_store_release in da_monitor_start(). */
> + return smp_load_acquire(&da_mon->monitoring);
>  }
>  
>  /*


^ permalink raw reply

* Re: [PATCH] tracing: Switch trace_recursion_record.c code over to use guard()
From: Steven Rostedt @ 2026-05-13 12:34 UTC (permalink / raw)
  To: Yash Suthar
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me
In-Reply-To: <CAPfzD4kiEMFdbK36X1_f1+Vn=v3nfvUODZJuJ40sNJ_fRr9zKA@mail.gmail.com>

On Tue, 12 May 2026 20:11:08 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:

> Gentle ping.

Hi,

What's the rush? This is just a clean up change. There's no feature here
that you need is there?

If you see it in patchwork[1], it's not lost. I just have other things
ahead of it. I usually process cleanup code last.

Thanks,

-- Steve

[1] https://patchwork.kernel.org/project/linux-trace-kernel/patch/20260502174741.39636-1-yashsuthar983@gmail.com/

^ permalink raw reply

* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: David Hildenbrand (Arm) @ 2026-05-13 11:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <agMpqhpgmezqnaA_@gmail.com>

On 5/12/26 15:33, Breno Leitao wrote:
> On Tue, May 12, 2026 at 10:21:50AM +0200, David Hildenbrand (Arm) wrote:
>>
>>>  		}
>>>  		goto unlock_mutex;
>>>  	} else if (res < 0) {
>>> -		if (is_reserved)
>>> +		/*
>>> +		 * Promote a stable unhandlable kernel page diagnosed by
>>> +		 * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
>>> +		 * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
>>> +		 */
>>> +		if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
>>>  			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>
>>
>> It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
>> is unhandable if it is PG_reserved?
> 
> Are you saying that we should identify if the page is PG_reserved in
> get_hwpoison_page() instead of in memory_failure(), as done in the
> previous patch ("mm/memory-failure: report MF_MSG_KERNEL for reserved
> pages") ?
> 
>> Why can't we just return a special error code from  get_hwpoison_page()? We ahve
>> plenty of errno values to chose from.
> 
> Something like:
> 
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 866c4428ac7ef..0a6d83575833e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -878,7 +878,7 @@ static const char *action_name[] = {
>  };
>  
>  static const char * const action_page_types[] = {
> -	[MF_MSG_KERNEL]			= "reserved kernel page",
> +	[MF_MSG_KERNEL]			= "unrecoverable kernel page",
>  	[MF_MSG_KERNEL_HIGH_ORDER]	= "high-order kernel page",
>  	[MF_MSG_HUGE]			= "huge page",
>  	[MF_MSG_FREE_HUGE]		= "free huge page",
> @@ -1394,6 +1394,21 @@ static int get_any_page(struct page *p, unsigned long flags)
>  	int ret = 0, pass = 0;
>  	bool count_increased = false;
> 
> +	if (PageReserved(p)) {
> +		ret = -ENOTRECOVERABLE;
> +		goto out;
> +	}
> +
>  	if (flags & MF_COUNT_INCREASED)
>  		count_increased = true;
>  
> @@ -1422,7 +1437,7 @@ static int get_any_page(struct page *p, unsigned long flags)
>  				shake_page(p);
>  				goto try_again;
>  			}
> -			ret = -EIO;
> +			ret = -ENOTRECOVERABLE;
>  			goto out;
>  		}
>  	}
> @@ -1441,10 +1456,10 @@ static int get_any_page(struct page *p, unsigned long flags)
>  			goto try_again;
>  		}
>  		put_page(p);
> -		ret = -EIO;
> +		ret = -ENOTRECOVERABLE;
>  	}
>  out:
> -	if (ret == -EIO)
> +	if (ret == -EIO || ret == -ENOTRECOVERABLE)
>  		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
>  
>  	return ret;
> @@ -2431,6 +2448,9 @@ int memory_failure(unsigned long pfn, int flags)
>  			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
>  		}
>  		goto unlock_mutex;
> +	} else if (res == -ENOTRECOVERABLE) {
> +		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> +		goto unlock_mutex;
>  	} else if (res < 0) {
>  		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>  		goto unlock_mutex;

That might probably read nicer as

switch (res) {
case 0: ...
case 1: ...
case -ENOTRECOVERABLE:  ...
case ...
default:
}

> 
> 
> If that is what you are suggestion, maybe we can create another
> MF_MSG_RESERVED? and another return value for get_any_page() to track
> the reserve pages ?

I guess "reserved" is really just like most other kernel pages. So I wouldn't
special-case them here.

Or would there be a good reason?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-13  9:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Jiri Olsa, Masami Hiramatsu, Andrii Nakryiko,
	bpf, linux-trace-kernel, Oleg Nesterov, Peter Zijlstra,
	Ingo Molnar
In-Reply-To: <CAEf4Bza4sqLw8GHoq+MFBgsnhiJ_s91UUrMcA-paYeBr7=bz0A@mail.gmail.com>

On Tue, May 12, 2026 at 12:38:34PM -0700, Andrii Nakryiko wrote:
> On Tue, May 12, 2026 at 12:27 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > +       /*
> > > +        * We have nop10 (with first byte overwritten to int3),
> > > +        * change it to:
> > > +        *   lea 0x80(%rsp), %rsp
> > > +        *   call tramp
> > > +        *
> > > +        * The first lea instruction skips the stack redzone so the call
> > > +        * instruction can safely push return address on stack.
> > > +        */
> >
> > typo: lea -128(%rsp), %rsp

ugh, thanks

> >
> > you can also do:
> >
> > add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.
> 
> When I asked AI about this it explained that add instruction modifies
> flags, so it's not a good fit here. lea doesn't touch flags.
> 
> >
> > Initially I didn't like this approach, since we just introduced
> > usdt nop5 and now need to recompile everything again,
> > but looking at the fix it's definitely simpler than alternatives
> > and doesn't have annoying limitations.
> 
> 
> yeah, limitations are annoying, especially with those global "DO NOT
> OPTIMIZE" flags... Jiri, let's polish your version and land it?

ok, will send it out

jirka

^ permalink raw reply

* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-13  9:31 UTC (permalink / raw)
  To: Wen Yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <cb929b8b-5bfb-4afe-ba50-45620c38ea96@linux.dev>

On Wed, 2026-05-13 at 13:32 +0800, Wen Yang wrote:
> Thanks for both messages.  Two patches are ready; let me address
> your follow-up concerns before sending.
> 
>    1. "all monitors reusing slots would suffer from it"
> 
>       Only RV_MON_PER_TASK uses the rv_get/put_task_monitor_slot()
>       pool.  RV_MON_GLOBAL and RV_MON_PER_CPU each have dedicated
>       storage (a single static variable and a per-cpu variable) and
>       never share slots across monitor types.  The race is exclusive
>       to PER_TASK, so fixing that variant's da_monitor_destroy() is
>       the correct scope.
> 
>    2. "LTL monitors don't even have monitoring"
> 
>       tracepoint_synchronize_unregister() does not rely on the
>       monitoring flag at all.  It is a system-wide barrier — it
>       calls synchronize_rcu_tasks_trace() followed by
>       synchronize_srcu(&tracepoint_srcu) — draining every in-flight
>       tracepoint handler on every CPU regardless of which monitor
>       dispatched it.  LTL handlers are covered without any special
>       treatment.
> 
> The slot-ordering issue (patch 1) affects all per-task DA monitors,
> not only HA ones — "independent on HA" — because
> RV_PER_TASK_MONITOR_INIT equals CONFIG_RV_PER_TASK_MONITORS (one
> past the end of rv[]), so da_monitor_reset_all() overwrites whatever
> follows rv[] in task_struct whenever any per-task monitor is
> disabled.

Exactly, and since whatever follows .rv is randomised on a task_struct, this can
get quite nasty.

I included my version of the fix in the series in [1], but feel free to send
yours, you got there first ;)

> 
> Also corrected "wwnr probe handler" to "stall probe handler" in
> patch 2 per your annotation.
> 

While tracepoint_synchronize_unregister() does fix the race, I still see a timed
bomb in the way we do ha_monitor_reset_env().

Since we reused the same slots for per-task monitors (not for the others, you're
right I was brainfarting) we essentially don't know what happened before we do
da_monitor_init(), the same slot could have been used by an LTL monitor which
cannot even reliably clear the byte used by the monitoring flag.

Now, we either mandate all monitors to memset the entire slot (union
rv_task_monitor) or we don't assume anything about the slot's state during
initialisation. Any middle ground could reveal pesky bugs as soon as we refactor
the structs.

The latter idea is what I did in [1]. I believe that would make the
synchronisation superfluous.

What do you think?

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/20260512140250.262190-8-gmonaco@redhat.com

> Please let me know if the above reasoning addresses your concerns.
> 
> 
> --
> Best wishes,
> Wen
> 
> > > 
> > > >   include/rv/da_monitor.h | 18 ++++++++++++++++--
> > > >   1 file changed, 16 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > > > index 00ded3d5ab3f..d04bb3229c75 100644
> > > > --- a/include/rv/da_monitor.h
> > > > +++ b/include/rv/da_monitor.h
> > > > @@ -304,6 +304,20 @@ static int da_monitor_init(void)
> > > >   
> > > >   /*
> > > >    * da_monitor_destroy - return the allocated slot
> > > > + *
> > > > + * Call tracepoint_synchronize_unregister() before reset_all() to close
> > > > + * the race where an in-flight non-HA probe handler sets monitoring=1
> > > > + * (without calling timer_setup()) after da_monitor_reset_all() has
> > > > + * already cleared the slot but before the caller's own sync completes.
> > > > + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> > > > + * the same slot would call timer_delete() on a never-initialised
> > > > + * timer_list, triggering ODEBUG warnings.
> > > > + *
> > > > + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> > > > + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> > > > + * The caller's own __rv_disable_monitor() issues a second sync after
> > > > + * returning from disable(); that redundant call is harmless on the
> > > > + * infrequent admin (enable/disable) path.
> > > >    */
> > > >   static inline void da_monitor_destroy(void)
> > > >   {
> > > > @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> > > >   		WARN_ONCE(1, "Disabling a disabled monitor: "
> > > > __stringify(MONITOR_NAME));
> > > >   		return;
> > > >   	}
> > > > +	tracepoint_synchronize_unregister();
> > > > +	da_monitor_reset_all();
> > > >   	rv_put_task_monitor_slot(task_mon_slot);
> > > >   	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > > > -
> > > > -	da_monitor_reset_all();
> > > >   }
> > > >   
> > > >   #elif RV_MON_TYPE == RV_MON_PER_OBJ
> > 


^ permalink raw reply

* Re: [RFC PATCH v2 03/10] selftests/verification: fix verificationtest-ktap for out-of-tree execution
From: Gabriele Monaco @ 2026-05-13  8:32 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <7368ee25b1b45c92beb14c05be366b71da585ca4.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> verificationtest-ktap used a CWD-relative path (../ftrace/ftracetest)
> and a relative argument (../verification) for --rv.  This works when
> the shell changes into the verification directory first, but breaks
> when the script is invoked directly - e.g. by the kselftest runner or
> vng - because the working directory is the kernel source root, not the
> script's own directory.
> 
> Fix this by computing the script's directory from $0 with cd/dirname/pwd
> and using absolute paths for both the ftracetest invocation and the --rv
> argument.  Also export the directory to PATH so that check_requires in
> the ftracetest framework can locate helper binaries.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Just out of curiosity, how do you run the selftests?
Are you calling the script directly just to run /some/ of them?

The officially supported way is through make [1]:

  make -C tools/testing/selftests TARGETS=verification run_tests

(though I find it faster to omit TARGETS and just do make -C
tools/testing/selftests/verification).

Calling with make should set up all paths as needed.

> ---
>  tools/testing/selftests/verification/verificationtest-ktap | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/verification/verificationtest-ktap
> b/tools/testing/selftests/verification/verificationtest-ktap
> index 18f7fe324e2f..456b8578a307 100755
> --- a/tools/testing/selftests/verification/verificationtest-ktap
> +++ b/tools/testing/selftests/verification/verificationtest-ktap
> @@ -5,4 +5,6 @@
>  #
>  # Copyright (C) Arm Ltd., 2023
>  
> -../ftrace/ftracetest -K -v --rv ../verification
> +dir=$(cd "$(dirname "$0")" && pwd)
> +export PATH="$dir:$PATH"

Then if you really really need to call it directly, do you need to override
PATH?

And isn't it clearer to do:

  dir=$(realpath "$(dirname "$0")")

Thanks,
Gabriele

[1] - https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html

> +"$dir/../ftrace/ftracetest" -K -v --rv "$dir"


^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13  7:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: leitao, linmiaohe, nao.horiguchi, akpm, corbet, skhan, ljs,
	vbabka, rppt, surenb, mhocko, shuah, rostedt, mhiramat,
	mathieu.desnoyers, liam, linux-mm, linux-kernel, linux-doc,
	linux-kselftest, linux-trace-kernel, kernel-team
In-Reply-To: <20260512124837.38883-1-lance.yang@linux.dev>

On 5/12/26 14:48, Lance Yang wrote:
> 
> On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>> On 5/11/26 17:38, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Sample PG_reserved before the get_hwpoison_page() call. In the
>>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>>> reference before returning -EIO, after which the underlying page may
>>> have been freed and reallocated with page->flags reset; reading
>>> PageReserved(p) at that point would observe stale or unrelated state.
>>> The pre-call snapshot reflects what the page actually was at the
>>> time of the failure event.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>  mm/memory-failure.c | 19 ++++++++++++++++++-
>>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 866c4428ac7ef..f112fb27a8ff6 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	unsigned long page_flags;
>>>  	bool retry = true;
>>>  	int hugetlb = 0;
>>> +	bool is_reserved;
>>>  
>>>  	if (!sysctl_memory_failure_recovery)
>>>  		panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	 * In fact it's dangerous to directly bump up page count from 0,
>>>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>>  	 */
>>> +	/*
>>> +	 * Pages with PG_reserved set are not currently managed by the
>>> +	 * page allocator (memblock-reserved memory, driver reservations,
>>> +	 * etc.), so classify them as kernel-owned for reporting.
>>> +	 *
>>> +	 * Sample the flag before get_hwpoison_page(): in the
>>> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> +	 * reference before returning -EIO, after which page->flags may
>>> +	 * have been reset by the allocator.
>>> +	 */
>>> +	is_reserved = PageReserved(p);
>>> +
>>>  	res = get_hwpoison_page(p, flags);
>>>  	if (!res) {
>>>  		if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>>  		}
>>>  		goto unlock_mutex;
>>>  	} else if (res < 0) {
>>> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> +		if (is_reserved)
>>> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> +		else
>>> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> +					    MF_IGNORED);
>>>  		goto unlock_mutex;
>>>  	}
>>>  
>>>
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
> 
> Looks like we never get that far ...

Right, likely that should be removed+cleaned up then.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13  7:53 UTC (permalink / raw)
  To: jane.chu, Breno Leitao, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Jonathan Corbet, Shuah Khan, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <816e3d8e-22d2-49a4-92ae-981568f38792@oracle.com>

On 5/12/26 19:58, jane.chu@oracle.com wrote:
> 
> 
> On 5/12/2026 1:17 AM, David Hildenbrand (Arm) wrote:
>> On 5/11/26 17:38, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Sample PG_reserved before the get_hwpoison_page() call. In the
>>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>>> reference before returning -EIO, after which the underlying page may
>>> have been freed and reallocated with page->flags reset; reading
>>> PageReserved(p) at that point would observe stale or unrelated state.
>>> The pre-call snapshot reflects what the page actually was at the
>>> time of the failure event.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>   mm/memory-failure.c | 19 ++++++++++++++++++-
>>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 866c4428ac7ef..f112fb27a8ff6 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>       unsigned long page_flags;
>>>       bool retry = true;
>>>       int hugetlb = 0;
>>> +    bool is_reserved;
>>>         if (!sysctl_memory_failure_recovery)
>>>           panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>>        * In fact it's dangerous to directly bump up page count from 0,
>>>        * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>>        */
>>> +    /*
>>> +     * Pages with PG_reserved set are not currently managed by the
>>> +     * page allocator (memblock-reserved memory, driver reservations,
>>> +     * etc.), so classify them as kernel-owned for reporting.
>>> +     *
>>> +     * Sample the flag before get_hwpoison_page(): in the
>>> +     * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> +     * reference before returning -EIO, after which page->flags may
>>> +     * have been reset by the allocator.
>>> +     */
>>> +    is_reserved = PageReserved(p);
>>> +
>>>       res = get_hwpoison_page(p, flags);
>>>       if (!res) {
>>>           if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>>           }
>>>           goto unlock_mutex;
>>>       } else if (res < 0) {
>>> -        res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> +        if (is_reserved)
>>> +            res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> +        else
>>> +            res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> +                        MF_IGNORED);
>>>           goto unlock_mutex;
>>>       }
>>>  
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
>>
>> This all looks very odd.
>>
>> Why would you even want to call get_hwpoison_page() in the first place if you
>> find PageReserved?
>>
> 
> Ah, good point!
> It seems to me that all unhandable pages should head out to identify_page_state:
> 
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2411,6 +2411,10 @@ int memory_failure(unsigned long pfn, int flags)
>          * In fact it's dangerous to directly bump up page count from 0,
>          * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>          */
> +
> +       if (!HWPoisonHandlable(page, flags)
> +               goto identify_page_state;
> +
>         res = get_hwpoison_page(p, flags);
>         if (!res) {
>                 if (is_free_buddy_page(p)) {

That's one option, or we just let get_hwpoison_page() return clearer error
codes, let it take care of checking PageReserved, and process the error codes
return by get_hwpoison_page() in a better way.

-- 
Cheers,

David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox