Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH 7/8] riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
From: Wang Han @ 2026-05-27 12:35 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, linux-riscv, linux-kernel, linux-trace-kernel,
	live-patching, linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

Now that the metadata frame records, the kunwind state machine and
arch_stack_walk_reliable() are all in place, advertise the capability
to the rest of the kernel:

  * select HAVE_RELIABLE_STACKTRACE under FRAME_POINTER && 64BIT, so
    only the configurations that actually have the metadata records
    and the FP-based reliable walker enable it.
  * select HAVE_LIVEPATCH under the same condition and source
    kernel/livepatch/Kconfig so the livepatch menu is reachable from
    the RISC-V configuration.

This is split out from the unwinder change so the policy decision and
the implementation can be reviewed and reverted independently.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 674044754378..2921680d2132 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -185,6 +185,7 @@ config RISCV
 	select HAVE_KRETPROBES
 	# https://github.com/ClangBuiltLinux/linux/issues/1881
 	select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
+	select HAVE_LIVEPATCH if FRAME_POINTER && 64BIT
 	select HAVE_MOVE_PMD
 	select HAVE_MOVE_PUD
 	select HAVE_PAGE_SIZE_4KB
@@ -195,6 +196,7 @@ config RISCV
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_PREEMPT_DYNAMIC_KEY
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RELIABLE_STACKTRACE if FRAME_POINTER && 64BIT
 	select HAVE_RETHOOK
 	select HAVE_RSEQ
 	select HAVE_RUST if RUSTC_SUPPORTS_RISCV && CC_IS_CLANG
@@ -1394,3 +1396,5 @@ endmenu # "CPU Power Management"
 source "arch/riscv/kvm/Kconfig"
 
 source "drivers/acpi/Kconfig"
+
+source "kernel/livepatch/Kconfig"
-- 
2.43.0


^ permalink raw reply related

* [PATCH 8/8] selftests/livepatch: Add RISC-V syscall wrapper prefix
From: Wang Han @ 2026-05-27 12:35 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, linux-riscv, linux-kernel, linux-trace-kernel,
	live-patching, linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

The syscall livepatch selftest resolves and patches a syscall wrapper
symbol. To use that test for RISC-V livepatch validation, add the
RISC-V FN_PREFIX definition for ARCH_HAS_SYSCALL_WRAPPER.

Without this macro, the syscall livepatch selftest cannot resolve the
RISC-V target symbol, and the syscall-related livepatch test fails on
RISC-V.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 .../testing/selftests/livepatch/test_modules/test_klp_syscall.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
index dd802783ea84..275e4b10cf59 100644
--- a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
+++ b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
@@ -18,6 +18,8 @@
 #define FN_PREFIX __s390x_
 #elif defined(__aarch64__)
 #define FN_PREFIX __arm64_
+#elif defined(__riscv)
+#define FN_PREFIX __riscv_
 #else
 /* powerpc does not select ARCH_HAS_SYSCALL_WRAPPER */
 #define FN_PREFIX
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCHv6 bpf-next 01/29] ftrace: Add ftrace_hash_count function
From: bot+bpf-ci @ 2026-05-27 12:37 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260527113951.46265-2-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]

> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 28b30c6f1031..282da661f131 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -551,6 +551,8 @@ int update_ftrace_direct_mod(struct ftrace_ops *ops, struct ftrace_hash *hash, b
>
>  void ftrace_stub_direct_tramp(void);
>
> +unsigned long ftrace_hash_count(struct ftrace_hash *hash);
> +
>  #else
>  struct ftrace_ops;
>  static inline unsigned long ftrace_find_rec_direct(unsigned long ip)

Is a fallback stub needed here? The function is declared under
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS but other direct call functions in
this header provide static inline fallbacks in the #else block. If an external
caller uses this function on a kernel configured without direct calls, would
this cause a build failure due to an implicit declaration?

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26509800686

^ permalink raw reply

* Re: [PATCHv6 bpf-next 22/29] selftests/bpf: Add tracing multi skel/pattern/ids module attach tests
From: bot+bpf-ci @ 2026-05-27 12:59 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260527113951.46265-23-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2238 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> index e290513358d9..746a8398211a 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c

[ ... ]

> @@ -241,6 +250,96 @@ static void test_link_api_ids(void)

[ ... ]

> +static void test_module_link_api_ids(void)
> +{
> +	size_t cnt = ARRAY_SIZE(bpf_testmod_fentry_test);
> +	LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
> +	struct tracing_multi_module *skel = NULL;
> +	__u32 *ids;
> +
> +	skel = tracing_multi_module__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "tracing_multi_module__open_and_load"))
> +		return;
> +
> +	skel->bss->pid = getpid();
> +
> +	ids = get_ids(bpf_testmod_fentry_test, cnt, "bpf_testmod");
> +	if (!ASSERT_OK_PTR(ids, "get_ids"))
> +		goto cleanup;

This new test calls get_ids() but there's an unaddressed review comment
from sashiko-bot about a memory leak in that function. Looking at get_ids()
in the same file:

    for (i = 0; i < funcs_cnt; i++) {
        if (!tsearch(&funcs[i], &root, compare)) {
            ASSERT_FAIL("tsearch failed");
            err = -1;
            goto out;
        }
    }

    ...

    if (err) {
        free(ids);
        ids = NULL;
    }

out:
    tdestroy(root, tdestroy_free_nop);
    btf__free(vmlinux_btf);
    btf__free(btf);
    return ids;

If tsearch() fails, the code sets err = -1 and jumps to the out label.
However, the out label is positioned after the cleanup block that frees ids
(the "if (err)" block), not before it. This means the cleanup code is
bypassed when we goto out, causing two issues:

1. The ids array is leaked

2. The function returns a non-NULL pointer to allocated but potentially
   invalid memory, which will incorrectly pass the ASSERT_OK_PTR check above

Should the out label be moved before the cleanup block so that the error
path properly frees the ids array and sets it to NULL?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26509800686

^ permalink raw reply

* Re: [PATCHv6 bpf-next 04/29] bpf: Use mutex lock pool for bpf trampolines
From: bot+bpf-ci @ 2026-05-27 12:59 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260527113951.46265-5-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3093 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1c6863ce89e0..db89b3a591fa 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h

[ ... ]

> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> index a4298a25d4ba..df7634de0ca2 100644
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -30,6 +30,35 @@ static struct hlist_head trampoline_ip_table[TRAMPOLINE_TABLE_SIZE];
>  /* serializes access to trampoline tables */
>  static DEFINE_MUTEX(trampoline_mutex);
>
> +/*
> + * Keep 32 trampoline locks (5 bits) in the pool so trampoline_lock_all()
> + * stays below MAX_LOCK_DEPTH.  Each pool slot has a distinct lockdep
> + * class because trampoline_lock_all() takes all pool mutexes at once;
> + * otherwise lockdep would report recursive locking on same-class mutexes.
> + */
> +#define TRAMPOLINE_LOCKS_BITS 5
> +#define TRAMPOLINE_LOCKS_TABLE_SIZE (1 << TRAMPOLINE_LOCKS_BITS)
> +
> +static struct {
> +	struct mutex mutex;
> +	struct lock_class_key key;
> +} trampoline_locks[TRAMPOLINE_LOCKS_TABLE_SIZE];
> +
> +static struct mutex *select_trampoline_lock(struct bpf_trampoline *tr)
> +{
> +	return &trampoline_locks[hash_64((u64)(uintptr_t) tr, TRAMPOLINE_LOCKS_BITS)].mutex;
> +}

[ ... ]

> @@ -99,12 +128,12 @@ static int bpf_tramp_ftrace_ops_func(struct ftrace_ops *ops, unsigned long ip,
>  	 *   cleanup_direct_functions_after_ipmodify
>  	 *
>  	 * In both cases, direct_mutex is already locked. Use
> -	 * mutex_trylock(&tr->mutex) to avoid deadlock in race condition
> -	 * (something else is making changes to this same trampoline).
> +	 * mutex_trylock(select_trampoline_lock(tr)) to avoid deadlock in race condition
> +	 * (something else holds the same pool lock).
>  	 */
> -	if (!mutex_trylock(&tr->mutex)) {
> -		/* sleep 1 ms to make sure whatever holding tr->mutex makes
> -		 * some progress.
> +	if (!mutex_trylock(select_trampoline_lock(tr))) {
> +		/* sleep 1 ms to make sure whatever holding select_trampoline_lock(tr)
> +		 * makes some progress.
>  		 */
>  		msleep(1);
>  		return -EAGAIN;

With 32 trampolines sharing pool locks, can mutex_trylock fail spuriously
when an unrelated trampoline holds the same bucket?

If the trylock fails, the function returns -EAGAIN. Could this mean
concurrent ftrace or kprobe registrations might randomly fail due to false
sharing in the pool?

Also, when the upcoming tracing_multi feature locks all 32 mutexes at once,
wouldn't that deterministically abort any concurrent ftrace IPMODIFY
registrations system-wide?

Finally, if this trylock failure happens during unregistration, could
ftrace cleanup routines ignore the -EAGAIN, potentially leaving the
BPF_TRAMP_F_SHARE_IPMODIFY flag permanently set and corrupting the
trampoline state?

Reference: https://lore.kernel.org/bpf/20260418035227.168AEC19424@smtp.kernel.org/


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26509800686

^ permalink raw reply

* Re: [PATCHv6 bpf-next 10/29] bpf: Add multi tracing attach types
From: bot+bpf-ci @ 2026-05-27 12:59 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260527113951.46265-11-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 4405 bytes --]

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 435ae6efe961..27ad0c265c95 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -41,6 +41,7 @@
>  #include <linux/overflow.h>
>  #include <linux/cookie.h>
>  #include <linux/verification.h>
> +#include <linux/btf_ids.h>
>
>  #include <net/netfilter/nf_bpf_link.h>
>  #include <net/netkit.h>

[ ... ]

> @@ -2904,6 +2914,11 @@ static int bpf_prog_mark_insn_arrays_ready(struct bpf_prog *prog)
>  	return 0;
>  }
>
> +extern int bpf_multi_func(void);
> +int __init __used bpf_multi_func(void) { return 0; }

The signature change from v5 to v6 creates a potential issue. In v5,
bpf_multi_func had 6 arguments (matching the initial verifier setup):

  int __init __used bpf_multi_func(u64 r1, u64 r2, u64 r3,
                                   u64 r4, u64 r5, u64 r6)

Now it has no arguments. The Sashiko AI review in v5 identified that
context array bounds checking could fail when bpf_multi_func's argument
count doesn't match the attached function's argument count, and the
author acknowledged the need to disable direct context access.

Since the initial verification happens against bpf_multi_func (which now
has 0 arguments), but the verifier previously allowed direct ctx[0-5]
accesses based on the 6-argument version, is there still a path where
direct context accesses could be allowed? If so, would accessing ctx[0]
on a function with 0 arguments lead to out-of-bounds kernel stack reads?

The earlier review suggested either:
  (a) explicitly disabling direct memory accesses to the ctx array for
      FENTRY_MULTI and FEXIT_MULTI, or
  (b) enforcing the use of bounds-checked helpers like bpf_get_func_arg()

Was this implemented somewhere in the patch series?

> +
> +BTF_ID_LIST_GLOBAL_SINGLE(bpf_multi_func_btf_id, func, bpf_multi_func)

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index c8d980fdd709..58450cb88beb 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[ ... ]

> @@ -18837,6 +18839,11 @@ static int check_attach_modify_return(unsigned long addr, const char *func_name)
>
>  #endif /* CONFIG_FUNCTION_ERROR_INJECTION */
>
> +static bool is_tracing_multi_id(const struct bpf_prog *prog, u32 btf_id)
> +{
> +	return is_tracing_multi(prog->expected_attach_type) && bpf_multi_func_btf_id[0] == btf_id;
> +}
> +
>  int bpf_check_attach_target(struct bpf_verifier_log *log,
>  			    const struct bpf_prog *prog,
>  			    const struct bpf_prog *tgt_prog,

[ ... ]

> @@ -19094,7 +19105,18 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
>  		if (ret < 0)
>  			return ret;
>
> -		if (tgt_prog) {
> +		/*
> +		 * *.multi programs don't need an address during program
> +		 * verification, we just take the module ref if needed.
> +		 */
> +		if (is_tracing_multi_id(prog, btf_id)) {
> +			if (btf_is_module(btf)) {
> +				mod = btf_try_get_module(btf);
> +				if (!mod)
> +					return -ENOENT;
> +			}
> +			addr = 0;
> +		} else if (tgt_prog) {
>  			if (subprog == 0)
>  				addr = (long) tgt_prog->bpf_func;
>  			else

[ ... ]

> @@ -19122,6 +19144,12 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
>  			ret = -EINVAL;
>  			switch (prog->type) {
>  			case BPF_PROG_TYPE_TRACING:
> +				/* *.multi sleepable programs will pass initial sleepable check,
> +				 * the actual attached btf ids are checked later during the link
> +				 * attachment.
> +				 */
> +				if (is_tracing_multi_id(prog, btf_id))
> +					ret = 0;
>  				if (!check_attach_sleepable(btf_id, addr, tname))
>  					ret = 0;

The multi types are not included in btf_validate_prog_ctx_type() in
kernel/bpf/btf.c. That function validates which attach types allow u64*
as their ctx parameter:

kernel/bpf/btf.c:btf_validate_prog_ctx_type() {
    case BPF_TRACE_FENTRY:
    case BPF_TRACE_FEXIT:
    case BPF_MODIFY_RETURN:
        ...
}

Since BPF_TRACE_FENTRY_MULTI and BPF_TRACE_FEXIT_MULTI also use u64*
context, do they need to be added to that validation list? Or if direct
context access should be disabled for multi types (per the earlier
review concern), should they be handled with special rejection logic?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26509800686

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-05-27 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <agXBb0ga_6HJrrnm@shell.ilvokhin.com>

Hi Peter,

Gentle ping on this. I wanted to check if the assembly analysis in my
previous reply changed the picture at all.

You were right that the commit message was misleading about the total
size increase: it's 9 bytes per call site, not just the NOP.

That said, when I looked at the executed path with the tracepoint
disabled, the only addition is the 2-byte NOP (xchg %ax,%ax).

Both the baseline and instrumented _raw_spin_unlock() fit within a
single 64-byte cache line, and I wasn't able to measure any difference
with locktorture: lock() cost completely dominates, unlock() accounts
for less than 1% of the total, so any overhead is indistinguishable from
noise.

If the cost is still a concern, I see two possible paths forward:

1. Guard the spinlock/qrwlock instrumentation behind a Kconfig option
   (disabled by default), so only kernels that explicitly opt in pay
   the cost.

2. Drop the spinlock/qrwlock instrumentation entirely and keep
   contended_release for sleepable locks only.

Happy to go whichever direction you prefer.

^ permalink raw reply

* Re: [PATCH v21 8/9] ring-buffer: Show persistent buffer dropped events in trace file
From: Steven Rostedt @ 2026-05-27 13:35 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Ian Rogers
In-Reply-To: <20260527124721.d05102c2f45e6c5bb5fbe476@kernel.org>

On Wed, 27 May 2026 12:47:21 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Yeah, for the persistent ring buffer, it does not happen.
> But there seems RB_MISSED_EVENTS bit can be cleared in
> "else" path (after applying 1-8 patches)?

Note, *only* the persistent ring buffer adds RB_MISSED_EVENTS to the pages
in the write buffer. In the normal buffer, these bits are only set by this
function. That is, they would not be set from the swap of pages.

> 
> ----------
> 	if (read || (len < (commit - read)) ||
> 	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
> 	    force_memcpy) { // <-- persistent ring buffer sets force_memcpy = true.
> [...]
> 	} else {
> 		/* update the entry counter */
> [...]
> 		if (!missed_events && rb_data_page_commit(dpage) & RB_MISSED_EVENTS)
> 			missed_events = -1;	
> 			//^-- we check RB_MISSED_EVENTS bit on @dpage->commit and set missed_events = -1.
> 
> 		/*
> 		 * Use the real_end for the data size,
> 		 * This gives us a chance to store the lost events
> 		 * on the page.
> 		 */
> 		if (reader->real_end)
> 			local_set(&dpage->commit, reader->real_end);
>                         // ^- only if @reader->real_end, RB_MISSED_EVENTS bit is dropped.

Because this isn't a persistent ring buffer (if it was, as you noted,
force_memcpy would be true and we wouldn't enter the else path), the
RB_MISSED_EVENTS bit in the commit would never be set here. It is *only* set
by the verifier of the persistent ring buffer logic.

> 	}
> 
> 	cpu_buffer->lost_events = 0;
> 
> 	commit = rb_data_page_commit(dpage);
> 	/*
> 	 * Set a flag in the commit field if we lost events
> 	 */
> 	if (missed_events) {
> 		/*
> 		 * If there is room at the end of the page to save the
> 		 * missed events, then record it there.
> 		 */
> 		if (missed_events > 0 &&
> 		    buffer->subbuf_size - commit >= sizeof(missed_events)) {
> 			memcpy(&dpage->data[commit], &missed_events,
> 			       sizeof(missed_events));
> 			local_add(RB_MISSED_STORED, &dpage->commit);
> 			commit += sizeof(missed_events);
> 		}
> 		local_add(RB_MISSED_EVENTS, &dpage->commit); // <-- @dpage->commit is updated.
> 	}

And this is the first place it would get set.

But yeah, it is very confusing and needs better comments.

Thanks,

-- Steve

^ permalink raw reply

* [PATCH v8 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-05-27 14:06 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org>

The first entry of error_states[],

	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },

is unreachable.  identify_page_state() has two callers, and neither
one can dispatch a PG_reserved page to me_kernel():

  * memory_failure() reaches identify_page_state() only after
    get_hwpoison_page() returned 1.  get_any_page() reaches that
    return only via __get_hwpoison_page(), which only takes a
    refcount when the page is HWPoisonHandlable().
    HWPoisonHandlable() is an allowlist for LRU, free-buddy, and
    (for soft-offline) movable_ops pages -- PG_reserved pages do
    not satisfy any of these, so they fail with -EBUSY/-EIO long
    before identify_page_state() runs.

  * try_memory_failure_hugetlb() reaches identify_page_state() only
    via the MF_HUGETLB_IN_USED branch, where the page is necessarily
    a hugetlb folio.  hugetlb folios don't carry PG_reserved at that
    point: hugetlb_folio_init_vmemmap() calls __folio_clear_reserved()
    during init, so the reserved entry would not match even if it
    were still present.

me_kernel() never executes and the entry exists only to be matched
against by code that cannot see it.

Drop the entry, the me_kernel() helper, and the now-unused
"reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
remains part of the tracepoint and pr_err() string tables, and
follow-on work to classify unrecoverable kernel pages can reuse it
without churning the user-visible enum.

No functional change.

Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c405..f4d3e6e20e13 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -980,17 +980,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
 	return false;
 }
 
-/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page_state *ps, struct page *p)
-{
-	unlock_page(p);
-	return MF_IGNORED;
-}
-
 /*
  * Page in unknown state. Do nothing.
  * This is a catch-all in case we fail to make sense of the page state.
@@ -1199,10 +1188,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 #define mlock		(1UL << PG_mlocked)
 #define lru		(1UL << PG_lru)
 #define head		(1UL << PG_head)
-#define reserved	(1UL << PG_reserved)
 
 static struct page_state error_states[] = {
-	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
 	 * PG_buddy pages only make a small fraction of all free pages.
@@ -1234,7 +1221,6 @@ static struct page_state error_states[] = {
 #undef mlock
 #undef lru
 #undef head
-#undef reserved
 
 static void update_per_node_mf_stats(unsigned long pfn,
 				     enum mf_result result)

-- 
2.54.0


^ permalink raw reply related

* [PATCH v8 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-27 14:06 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team, Lance Yang

A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running.  The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash.  In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.

This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context.  The default is disabled so existing workloads see no
change.

There is a selftest that test different cases, and I tested it using
the following variants:

  ┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
  │ Variant │   PFN    │                          Result                           │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ rodata  │ 0x2600   │ Panic with "Memory failure: 0x2600: unrecoverable page"   │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ slab    │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
  └─────────┴──────────┴───────────────────────────────────────────────────────────┘

Each one shows the same call trace, exactly the path the series builds:

  hard_offline_page_store
    → memory_failure
      → action_result
        → panic("Memory failure: %#lx: unrecoverable page")

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v8:
- Commit message rewording (David)
- Add HWPoisonKernelOwned() helper (Lance)
- Removed patch "mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()"
- Broaden the selftest (Lance)
- Link to v7: https://patch.msgid.link/20260513-ecc_panic-v7-0-be2e578e61da@debian.org

Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
  get_any_page() and surface it via -ENOTRECOVERABLE, per David
  Hildenbrand's and Lance Yang's review of v6.  This drops the
  is_reserved snapshot in memory_failure() and the mf_get_page_status
  enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
  over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
  matching tracepoint string; the enum now covers both PG_reserved
  pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
  and 2/4 ("classify get_any_page() failures by reason") into a
  single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
  (David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org

Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org

Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
  unrecoverable kernel page hwpoison events (reserved pages, refcount-0
  non-buddy pages, unknown state), with a recheck to avoid racing with
  concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
  document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
  and add a selftest verifying SIGBUS recovery on userspace pages still
  works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
  https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org

Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (6):
      mm/memory-failure: drop dead error_states[] entry for reserved pages
      mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
      mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl
      selftests/mm: add hwpoison-panic destructive test

 Documentation/admin-guide/sysctl/vm.rst      |  85 ++++++++++++
 mm/memory-failure.c                          |  96 ++++++++++---
 tools/testing/selftests/mm/Makefile          |   1 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 193 +++++++++++++++++++++++++++
 4 files changed, 357 insertions(+), 18 deletions(-)
---
base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-27 14:06 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org>

get_any_page() collapses every HWPoisonHandlable() rejection into a
single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
-> retry path.  That is correct for the transient case (a userspace
folio briefly off LRU during migration or compaction, which a later
shake can drag back), but wrong for stable kernel-owned pages: slab,
page-table, large-kmalloc and PG_reserved pages will never become
HWPoisonHandlable(), so the retry loop is wasted work and the final
-EIO loses the "this is structurally unrecoverable" information.
memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
panic-on-unrecoverable sysctl deliberately does not act on.

Introduce HWPoisonKernelOwned(), a small predicate that positively
identifies pages the hwpoison handler cannot recover from:

  HWPoisonKernelOwned(p, flags) :=
      !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
      (PageReserved(p) || PageSlab(p) ||
       PageTable(p)    || PageLargeKmalloc(p))

The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
same exception in HWPoisonHandlable(): soft-offline is allowed to
migrate movable_ops pages even though they are not on the LRU, and
we must not pre-empt that with an unrecoverable verdict.

The list is intentionally not exhaustive.  vmalloc and kernel-stack
pages, for example, do not carry a page_type bit and would need a
different oracle; they keep going through the existing retry path
unchanged.  This is the smallest set we can identify with certainty
by page type.

Wire the helper into the top of get_any_page() to short-circuit
those pages before the retry loop runs.  On a hit, drop the caller's
MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
straight away.  Pages outside the helper's positive list still take
the existing retry path and return -EIO, leaving operator-visible
behaviour for those cases unchanged.

Extend the unhandlable-page pr_err() to fire for either errno and
update the get_hwpoison_page() kerneldoc to document the new return.

memory_failure() still folds every negative return into
MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
this patch on its own only changes the errno that soft_offline_page()
can propagate to its callers.  A follow-up wires -ENOTRECOVERABLE
through memory_failure() and reports MF_MSG_KERNEL for the
unrecoverable cases, which is what the
panic_on_unrecoverable_memory_failure sysctl observes.

Suggested-by: David Hildenbrand <david@kernel.org>
Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f4d3e6e20e13..8f63bdfeff8f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1325,6 +1325,28 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
 	return PageLRU(page) || is_free_buddy_page(page);
 }
 
+/*
+ * Positive identification of pages the hwpoison handler cannot recover.
+ * These page types are owned by kernel internals (no userspace mapping
+ * to unmap, no file mapping to invalidate, no migration target), so the
+ * shake_page() / retry loop in get_any_page() can never turn them into
+ * something HWPoisonHandlable() will accept.  Short-circuit them to
+ * -ENOTRECOVERABLE so callers can panic on operator request instead of
+ * spinning through retries that exit as a transient-looking -EIO.
+ *
+ * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
+ * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
+ * pages even though they are not on the LRU.
+ */
+static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
+{
+	if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
+		return false;
+
+	return PageReserved(page) || PageSlab(page) ||
+	       PageTable(page) || PageLargeKmalloc(page);
+}
+
 static int __get_hwpoison_page(struct page *page, unsigned long flags)
 {
 	struct folio *folio = page_folio(page);
@@ -1371,6 +1393,19 @@ static int get_any_page(struct page *p, unsigned long flags)
 	if (flags & MF_COUNT_INCREASED)
 		count_increased = true;
 
+	/*
+	 * Page types we know are kernel-owned and cannot be recovered.
+	 * Short-circuit before the shake_page() / retry loop, which
+	 * cannot turn any of these into something HWPoisonHandlable().
+	 * Drop the caller's reference if MF_COUNT_INCREASED took one.
+	 */
+	if (HWPoisonKernelOwned(p, flags)) {
+		if (count_increased)
+			put_page(p);
+		ret = -ENOTRECOVERABLE;
+		goto out;
+	}
+
 try_again:
 	if (!count_increased) {
 		ret = __get_hwpoison_page(p, flags);
@@ -1418,7 +1453,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 		ret = -EIO;
 	}
 out:
-	if (ret == -EIO)
+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
 
 	return ret;
@@ -1475,7 +1510,10 @@ static int __get_unpoison_page(struct page *page)
  *         -EIO for pages on which we can not handle memory errors,
  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
  *         operations like allocation and free,
- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
+ *         -ENOTRECOVERABLE for kernel-owned pages identified by
+ *         HWPoisonKernelOwned() (PG_reserved, slab,
+ *         page-table, large-kmalloc) that the handler cannot recover.
  */
 static int get_hwpoison_page(struct page *p, unsigned long flags)
 {

-- 
2.54.0


^ permalink raw reply related

* [PATCH v8 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Breno Leitao @ 2026-05-27 14:06 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org>

The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
for stable unhandlable kernel pages (PG_reserved, slab, page tables,
large-kmalloc).  memory_failure() still folds every negative return
into MF_MSG_GET_HWPOISON, so callers that want to react to the
unrecoverable cases (a panic option, smarter logging) cannot tell
them apart from transient page-allocator races.

Turn the post-call branch into a switch over the get_hwpoison_page()
return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
negative return to MF_MSG_GET_HWPOISON.  case 0 keeps the existing
free-buddy / kernel-high-order handling and case 1 falls through to
the rest of memory_failure() unchanged.

The MF_MSG_KERNEL label and tracepoint string are kept as
"reserved kernel page" to avoid breaking userspace tools that match
on those literals; the enum value still adequately tags the failure
even though it now also covers slab, page tables and large-kmalloc
pages.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8f63bdfeff8f..14c0a958638c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2426,7 +2426,8 @@ int memory_failure(unsigned long pfn, int flags)
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
 	res = get_hwpoison_page(p, flags);
-	if (!res) {
+	switch (res) {
+	case 0:
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -2445,7 +2446,19 @@ int memory_failure(unsigned long pfn, int flags)
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;
-	} else if (res < 0) {
+	case 1:
+		/* Got a refcount on a handlable page. */
+		break;
+	case -ENOTRECOVERABLE:
+		/*
+		 * Stable unhandlable kernel-owned page (PG_reserved,
+		 * slab, page tables, large-kmalloc).
+		 * No recovery possible.
+		 */
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
+	default:
+		/* Transient lifecycle race with the page allocator. */
 		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 		goto unlock_mutex;
 	}

-- 
2.54.0

^ permalink raw reply related

* [PATCH v8 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-27 14:06 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org>

Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
default) that triggers a kernel panic when memory_failure()
encounters pages that cannot be recovered.  This provides a clean
crash with useful debug information rather than allowing silent
data corruption or a delayed crash at an unrelated code path.

Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
result == MF_IGNORED panics.  After the previous patch, MF_MSG_KERNEL
covers PG_reserved pages and the kernel-owned pages promoted from
get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
large-kmalloc).

All other action types are excluded:

- MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
  transient refcount races with the page allocator (an in-flight buddy
  allocation has refcount 0 and is no longer on the buddy free list,
  briefly), and panicking on them would risk killing the box for what
  is actually a recoverable userspace page.

- MF_MSG_UNKNOWN means identify_page_state() could not classify the
  page; that is precisely the wrong basis for a panic decision.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 14c0a958638c..dcd53dbc6aec 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
+		return false;
+
+	return type == MF_MSG_KERNEL;
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1272,6 +1292,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
+	if (panic_on_unrecoverable_mf(type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v8 5/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-05-27 14:06 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org>

Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing which failures trigger a panic (kernel-owned pages
the handler cannot recover) and which are intentionally left out
(transient allocator races and unclassified pages).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 85 +++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c..f71d87039904 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - page_lock_unfairness
 - panic_on_oom
+- panic_on_unrecoverable_memory_failure
 - percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
@@ -925,6 +926,90 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation.  This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on memory failure events
+hitting kernel-owned pages that the handler cannot recover:
+``PageReserved`` (firmware reservations, kernel image, vDSO, zero
+page, and similar memblock-reserved regions), ``PageSlab``,
+``PageTable``, and ``PageLargeKmalloc``.  These are owned by the
+kernel and the memory failure handler cannot reliably evict their
+contents.
+
+For soft offline (``madvise(MADV_SOFT_OFFLINE)``,
+``/sys/devices/system/memory/soft_offline_page``), pages owned by
+``movable_ops`` are exempted, since soft offline is allowed to
+migrate them even though they are not on the LRU.
+
+Other unrecoverable kernel-owned populations (vmalloc allocations,
+kernel stack pages, ...) are not currently covered because the
+handler has no page-type signal that distinguishes them from a
+userspace folio temporarily off the LRU during migration or
+compaction.  Such pages still go through the standard
+MF_MSG_GET_HWPOISON path: ``PG_hwpoison`` is set on them and a
+delayed crash on the next access remains possible.  Coverage may
+grow as the handler gains stronger kernel-ownership signals.
+
+Recoverable failure paths are also intentionally left out: in-flight
+buddy allocations and other transient races with the page allocator
+can reach the same diagnostic, and panicking on them would risk
+killing the box for a page destined for userspace where the standard
+SIGBUS recovery path applies.  Pages whose state could not be
+classified at all are not covered either, since an unknown state is
+not a sound basis for a panic decision.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+  regularly and post-mortem analysis of an unrelated downstream crash
+  (often seconds to minutes after the original error) consumes
+  significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+  hardware error produces a vmcore that still contains the faulting
+  address, the affected page state, and the originating MCE/GHES
+  record — context that is typically lost by the time a delayed crash
+  occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+  failure for failover, and prefer an immediate panic over silent data
+  corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+  tools such as ``mce-inject`` or error-injection debugfs interfaces,
+  where panicking on the unrecoverable path makes regressions
+  immediately visible instead of surfacing as later, unrelated
+  failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately.  If the ``panic`` sysctl is also non-zero then the
+  machine will be rebooted.
+= =====================================================================
+
+Example::
+
+     echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
 percpu_pagelist_high_fraction
 =============================
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v8 6/6] selftests/mm: add hwpoison-panic destructive test
From: Breno Leitao @ 2026-05-27 14:06 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org>

Add a destructive selftest that verifies
vm.panic_on_unrecoverable_memory_failure actually panics when a
hwpoison error hits a kernel-owned page.

Three "kinds" of kernel-owned page can be targeted, selectable via
the script's first positional argument (default: rodata):

  rodata  - a PG_reserved page in the kernel rodata range, sourced
            from the "Kernel rodata" sub-resource of "System RAM" in
            /proc/iomem.  That entry is reported on every major
            architecture and guarantees the chosen PFN is backed by
            struct page (an online System RAM range, not a firmware
            hole), is PG_reserved, and is read-only -- so even if
            the panic fails to fire for some reason, the resulting
            PG_hwpoison marker on rodata does not corrupt writable
            kernel state.

  slab    - a slab page found by walking /proc/kpageflags for the
            first PFN with KPF_SLAB set (and KPF_HWPOISON / KPF_NOPAGE
            / KPF_COMPOUND_TAIL clear).  Exercises the get_any_page()
            path on a non PG_reserved kernel-owned page and so
            catches regressions where get_any_page() collapses
            kernel-owned pages into a transient -EIO instead of
            -ENOTRECOVERABLE.

  pgtable - same as slab, but the PFN is selected via KPF_PGTABLE.

PageLargeKmalloc, the fourth page type matched by
HWPoisonKernelOwned(), is intentionally not covered: it is a
PAGE_TYPE_OPS flag with no /proc/kpageflags bit, so selecting such
a PFN from userspace is not feasible.  The slab and pgtable
variants already exercise the same get_any_page() positive-check
branch.

The script enables the sysctl and writes the selected physical
address to /sys/devices/system/memory/hard_offline_page.  A
successful run crashes the kernel with

  Memory failure: <pfn>: unrecoverable page

A return from the inject means the panic did not fire and the test
fails.  Test outcome is therefore observed externally (serial
console, kdump) rather than from the script's own exit code.

The script is intentionally NOT wired into run_vmtests.sh: every
successful run panics the kernel, which is incompatible with the
sequential "run each category in the same VM" model that
run_vmtests.sh assumes.  It is also not registered as a TEST_PROGS /
ksft_* wrapper so a default kselftest run does not opt itself into
a panic.  The script is meant to be executed manually inside a
disposable VM (e.g. virtme-ng), one variant per VM boot, and
requires RUN_DESTRUCTIVE=1 in the environment as a safety net.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/mm/Makefile          |   1 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 193 +++++++++++++++++++++++++++
 2 files changed, 194 insertions(+)

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e6df968f0971..170e376c97b4 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -181,6 +181,7 @@ TEST_FILES += charge_reserved_hugetlb.sh
 TEST_FILES += hugetlb_reparenting_test.sh
 TEST_FILES += test_page_frag.sh
 TEST_FILES += run_vmtests.sh
+TEST_FILES += hwpoison-panic.sh
 
 # required by charge_reserved_hugetlb.sh
 TEST_FILES += write_hugetlb_memory.sh
diff --git a/tools/testing/selftests/mm/hwpoison-panic.sh b/tools/testing/selftests/mm/hwpoison-panic.sh
new file mode 100755
index 000000000000..43fc379f8761
--- /dev/null
+++ b/tools/testing/selftests/mm/hwpoison-panic.sh
@@ -0,0 +1,193 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify vm.panic_on_unrecoverable_memory_failure by injecting a hwpoison
+# error on a kernel-owned page and confirming the kernel panics.
+#
+# Three "kinds" of kernel-owned page can be targeted, selectable via the
+# first positional argument (default: rodata):
+#
+#   rodata  - a PG_reserved page in the kernel rodata range
+#             (sourced from /proc/iomem "Kernel rodata").  Exercises
+#             memory_failure() -> get_any_page() on a PageReserved page.
+#
+#   slab    - a slab page found via /proc/kpageflags (KPF_SLAB).
+#             Exercises memory_failure() -> get_any_page() on a non
+#             PG_reserved kernel-owned page.  This path is what catches
+#             regressions where get_any_page() collapses kernel-owned
+#             pages into a transient -EIO instead of -ENOTRECOVERABLE.
+#
+#   pgtable - a page-table page found via /proc/kpageflags (KPF_PGTABLE).
+#             Same path as slab, different page type.
+#
+# This test is DESTRUCTIVE: a successful run crashes the kernel.  It is
+# meant to be executed inside a disposable VM (e.g. virtme-ng) with a
+# serial console captured by the harness.  It is skipped unless the
+# caller opts in via RUN_DESTRUCTIVE=1.
+#
+# Test passes externally: the kernel must panic with
+#   "Memory failure: <pfn>: unrecoverable page"
+# A return from the inject means the panic did not fire and the test
+# fails.
+#
+# Author: Breno Leitao <leitao@debian.org>
+
+set -u
+
+ksft_skip=4
+sysctl_path=/proc/sys/vm/panic_on_unrecoverable_memory_failure
+inject_path=/sys/devices/system/memory/hard_offline_page
+kpageflags_path=/proc/kpageflags
+
+# /proc/kpageflags bit positions (see include/uapi/linux/kernel-page-flags.h)
+KPF_SLAB=7
+KPF_COMPOUND_TAIL=16
+KPF_HWPOISON=19
+KPF_NOPAGE=20
+KPF_PGTABLE=26
+
+kind=${1:-rodata}
+
+ksft_print() { echo "# $*"; }
+ksft_exit_skip() { ksft_print "$*"; exit "$ksft_skip"; }
+ksft_exit_fail() { echo "not ok 1 $*"; exit 1; }
+
+if [ "$(id -u)" -ne 0 ]; then
+	ksft_exit_skip "must run as root"
+fi
+
+if [ ! -w "$sysctl_path" ]; then
+	ksft_exit_skip "$sysctl_path not present (kernel without the sysctl?)"
+fi
+
+if [ ! -w "$inject_path" ]; then
+	ksft_exit_skip "$inject_path not present (no MEMORY_HOTPLUG?)"
+fi
+
+if [ "${RUN_DESTRUCTIVE:-0}" != "1" ]; then
+	ksft_exit_skip "destructive test; re-run with RUN_DESTRUCTIVE=1 inside a disposable VM"
+fi
+
+# Pick a PFN inside the kernel image rodata region of /proc/iomem.
+# This is preferred over a top-level "Reserved" entry because top-level
+# Reserved ranges are often firmware holes that have no backing struct
+# page; pfn_to_online_page() returns NULL on those and memory_failure()
+# bails out with -ENXIO before reaching the panic path.
+#
+# "Kernel rodata" is reported as a sub-resource of "System RAM" on every
+# major architecture, which guarantees:
+#   - the PFN is backed by struct page (within an online memory range);
+#   - PG_reserved is set on the page (kernel image area);
+#   - the memory is read-only, so setting PG_hwpoison on it does not
+#     corrupt writable kernel state if the panic somehow does not fire.
+#
+# /proc/iomem entries look like (indented for sub-resources):
+#     "  02500000-02ffffff : Kernel rodata"
+pick_rodata_phys_addr() {
+	awk -v pagesize="$(getconf PAGE_SIZE)" '
+	/: Kernel rodata[[:space:]]*$/ {
+		sub(/^[[:space:]]+/, "")
+		n = split($0, a, /[- ]/)
+		start = strtonum("0x" a[1])
+		end   = strtonum("0x" a[2])
+		if (end <= start)
+			next
+		# Page-align upward and emit the first byte of that page.
+		pfn = int((start + pagesize - 1) / pagesize)
+		printf "0x%x\n", pfn * pagesize
+		exit 0
+	}
+	' /proc/iomem
+}
+
+# Walk /proc/kpageflags and return the phys addr of the first PFN that
+# has bit $1 set, with KPF_HWPOISON, KPF_NOPAGE and KPF_COMPOUND_TAIL
+# all clear (so we attack a real, non-tail, not-already-poisoned page).
+#
+# We skip the first 16 MiB of PFNs to step past low-memory special
+# ranges (BIOS/EFI/ACPI/etc.) that often are PG_reserved and would not
+# exhibit the slab/pgtable type we are looking for.
+pick_kpageflags_phys_addr() {
+	local want_bit=$1
+	local pagesize skip_pfn
+
+	[ -r "$kpageflags_path" ] || return
+
+	pagesize=$(getconf PAGE_SIZE)
+	skip_pfn=$(((16 * 1024 * 1024) / pagesize))
+
+	od -An -tx8 -v -w8 -j "$((skip_pfn * 8))" "$kpageflags_path" 2>/dev/null | \
+	awk -v want_bit="$want_bit" \
+	    -v hwp_bit="$KPF_HWPOISON" \
+	    -v nopage_bit="$KPF_NOPAGE" \
+	    -v tail_bit="$KPF_COMPOUND_TAIL" \
+	    -v base_pfn="$skip_pfn" \
+	    -v pagesize="$pagesize" '
+	# Test whether bit "b" is set in the 16-hex-digit value "hex".
+	# Done with substring + per-digit lookup so we never rely on awk
+	# bitwise operators (mawk lacks them) or 64-bit FP precision.
+	function bit_set(hex, b,    di, bi, c, v) {
+		di = int(b / 4)
+		bi = b - di * 4
+		c = substr(hex, length(hex) - di, 1)
+		v = strtonum("0x" c)
+		if (bi == 0) return (v % 2) == 1
+		if (bi == 1) return int(v / 2) % 2 == 1
+		if (bi == 2) return int(v / 4) % 2 == 1
+		return int(v / 8) % 2 == 1
+	}
+	{
+		gsub(/^[[:space:]]+/, "")
+		h = $1
+		if (bit_set(h, want_bit) &&
+		    !bit_set(h, hwp_bit) &&
+		    !bit_set(h, nopage_bit) &&
+		    !bit_set(h, tail_bit)) {
+			pfn = base_pfn + NR - 1
+			printf "0x%x\n", pfn * pagesize
+			exit 0
+		}
+	}
+	'
+}
+
+case "$kind" in
+rodata)
+	phys_addr=$(pick_rodata_phys_addr)
+	missing_msg='no "Kernel rodata" entry in /proc/iomem'
+	;;
+slab)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_SLAB")
+	missing_msg="no usable slab PFN found in $kpageflags_path"
+	;;
+pgtable)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_PGTABLE")
+	missing_msg="no usable page-table PFN found in $kpageflags_path"
+	;;
+*)
+	ksft_exit_fail "unknown kind '$kind' (expected: rodata|slab|pgtable)"
+	;;
+esac
+
+if [ -z "$phys_addr" ]; then
+	ksft_exit_skip "$missing_msg"
+fi
+
+ksft_print "enabling $sysctl_path"
+prior=$(cat "$sysctl_path")
+echo 1 > "$sysctl_path" || ksft_exit_fail "failed to enable sysctl"
+
+ksft_print "injecting hwpoison at phys 0x$(printf '%x' "$phys_addr") (kind=$kind)"
+ksft_print "expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'"
+
+# If this returns, the kernel did not panic → test failed.  Restore the
+# sysctl before reporting so the system is left as we found it.
+if echo "$phys_addr" > "$inject_path"; then
+	echo "$prior" > "$sysctl_path"
+	ksft_exit_fail "inject returned without panic; sysctl ineffective"
+fi
+
+# Write failed (e.g. -EINVAL on offlining a non-online region): also a
+# failure for this test, since we expected the panic path.
+echo "$prior" > "$sysctl_path"
+ksft_exit_fail "inject failed before reaching the panic path"

-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-27 14:08 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
	Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
	Ian Rogers
In-Reply-To: <ahayVg7TvNrf1ama@krava>

On Wed, 27 May 2026 10:59:02 +0200
Jiri Olsa <olsajiri@gmail.com> wrote:

> 
> hi,
> this seems to be supported only for argument (pointer) stored in the trace record,
> not the actual arguments to the tracepoint, is that right?
> 
> so I can deref worker from sched.sched_kthread_work_queue_work, like:
> 
>   echo 'e:myprobe sched.sched_kthread_work_queue_work (kthread_worker)worker->flags (kthread_work)work->canceling' > dynamic_events

Correct, that is because eprobes "e:" works on the output of a trace event.


> 
> but I can't deref sched.sched_process_exec p->pid, like:
> 
>   # echo 'e:myprobe sched.sched_process_exec (task_struct)p->pid' > dynamic_events
>   bash: echo: write error: Invalid argument

For function prototypes of a tracepoint, you would use a tprobe "t:"

 # echo 't:exec sched_process_exec pid=p->pid' > dynamic_events
 # echo 1 > events/tracepoints/exec/enable
 # cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 7/7   #P:8
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
    rtkit-daemon-1935    [005] .....   105.350235: exec: (__probestub_sched_process_exec+0x4/0x10) pid=1935
    rtkit-daemon-1935    [005] .....   105.376609: exec: (__probestub_sched_process_exec+0x4/0x10) pid=1935
 pkla-check-auth-1939    [000] .....   105.404491: exec: (__probestub_sched_process_exec+0x4/0x10) pid=1939
 at-spi-bus-laun-1953    [000] .....   105.914139: exec: (__probestub_sched_process_exec+0x4/0x10) pid=1953
     dbus-daemon-1959    [002] .....   105.919123: exec: (__probestub_sched_process_exec+0x4/0x10) pid=1959
        Xwayland-1961    [006] .....   106.175491: exec: (__probestub_sched_process_exec+0x4/0x10) pid=1961
           <...>-1962    [005] .....   107.406472: exec: (__probestub_sched_process_exec+0x4/0x10) pid=1962

No need for typecasting either ;-)


> > +	ctx->offset += tmp - arg;
> > +	ret = parse_btf_arg(tmp, pcode, end, ctx);
> > +	ctx->flags &= ~TPARG_FL_TYPECAST;
> > +	ctx->last_struct = NULL;
> > +out_put:
> > +	btf_put(ctx->struct_btf);  
> 
> 
> should we zero ctx->struct_btf in case there's more type casts,
> so query_btf_struct would re-init it?

Yeah, I already mentioned that mistake:

  https://lore.kernel.org/all/20260522072322.18aa72dd@gandalf.local.home/

>> Oops, I forgot to do:
>> 
>> 	ctx->struct_buf = NULL;
>> 
>> here.
>> 
>> Will fix.

Thanks for the review!

-- Steve

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Steven Rostedt @ 2026-05-27 14:09 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260527095438.1794905-1-dtcccc@linux.alibaba.com>

On Wed, 27 May 2026 17:54:37 +0800
Tianchen Ding <dtcccc@linux.alibaba.com> wrote:

> On ARM64 kernels with 64K pages, the trace_marker_raw test fails because
> bash's printf builtin uses stdio buffering which splits output into
> multiple small write() calls to the tracefs file. Since each individual
> write is within TRACE_MARKER_MAX_SIZE (4096), they all succeed, causing
> the "too big" write test to incorrectly pass.
> 
> Fix by piping make_str output through dd with iflag=fullblock to
> guarantee a single atomic write() syscall to trace_marker_raw.
> 
> Fixes: 37f46601383a ("selftests/tracing: Add basic test for trace_marker_raw file")
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> ---
>  .../selftests/ftrace/test.d/00basic/trace_marker_raw.tc     | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> index 8e905d4fe6dd..efd8263e6087 100644
> --- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> +++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> @@ -43,8 +43,10 @@ write_buffer() {
>  	id=$1
>  	size=$2
>  
> -	# write the string into the raw marker
> -	make_str $id $size > trace_marker_raw
> +	# Pipe through dd to ensure a single atomic write() syscall.
> +	# Shell's printf builtin uses stdio buffering which may split the
> +	# output into multiple writes.

Could you comment that this is for architectures with 64K pages too.

Thanks for fixing this,

-- Steve

> +	make_str $id $size | dd of=trace_marker_raw bs=`expr $size + 4` iflag=fullblock
>  }
>  
>  


^ permalink raw reply

* Re: [PATCH v2] perf/ftrace: Fix WARNING in __unregister_ftrace_function
From: Steven Rostedt @ 2026-05-27 14:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513161916.04151502@fangorn>

On Wed, 13 May 2026 16:19:16 -0400
Rik van Riel <riel@surriel.com> wrote:

> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index a6bb7577e8c5..58e1b427b576 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -497,7 +497,11 @@ static int perf_ftrace_function_register(struct perf_event *event)
>  static int perf_ftrace_function_unregister(struct perf_event *event)
>  {
>  	struct ftrace_ops *ops = &event->ftrace_ops;
> -	int ret = unregister_ftrace_function(ops);
> +	int ret = 0;
> +

Because this is different than unregister_ftrace_function() where it will
not fail if the ops is not registered, it deserves a comment. Something
like:

	/*
	 * Perf will call this unconditionally even if the ops is not
	 * enabled. The unregister_ftrace_function() will warn if called
	 * when not enabled. Just bypass the unregistering if ops isn't
	 * enabled here.
	 */

Thanks,

-- Steve


> +	if (ops->flags & FTRACE_OPS_FL_ENABLED)
> +		ret = unregister_ftrace_function(ops);
> +
>  	ftrace_free_filter(ops);
>  	return ret;
>  }

^ permalink raw reply

* Re: [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-27 14:16 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
	Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
	Ian Rogers
In-Reply-To: <20260527100815.53e55c57@gandalf.local.home>

On Wed, 27 May 2026 10:08:15 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> > this seems to be supported only for argument (pointer) stored in the trace record,
> > not the actual arguments to the tracepoint, is that right?
> > 
> > so I can deref worker from sched.sched_kthread_work_queue_work, like:
> > 
> >   echo 'e:myprobe sched.sched_kthread_work_queue_work (kthread_worker)worker->flags (kthread_work)work->canceling' > dynamic_events  
> 
> Correct, that is because eprobes "e:" works on the output of a trace event.
> 
> 
> > 
> > but I can't deref sched.sched_process_exec p->pid, like:
> > 
> >   # echo 'e:myprobe sched.sched_process_exec (task_struct)p->pid' > dynamic_events
> >   bash: echo: write error: Invalid argument  
> 
> For function prototypes of a tracepoint, you would use a tprobe "t:"
> 
>  # echo 't:exec sched_process_exec pid=p->pid' > dynamic_events


Hmm, this is why I need to write a book ;-)

Thanks for helping with what content I need to add!

-- Steve

^ permalink raw reply

* [PATCH 1/3] rtla/timerlat: Add -A/--aligned CLI option
From: Tomas Glozar @ 2026-05-27 14:49 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

Add a new option, -A/--aligned, that enables timerlat thread alignment
implemented on the kernel-side in commit 4245bf4dc58f ("tracing/osnoise:
Add option to align tlat threads"). The option takes an argument,
representing alignment between timerlat threads in microseconds.

The feature is modeled after the option of the same name in the
cyclictest tool.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---

This patchset depends on "rtla: Migrate to libsubcmd for command line option parsing"
- https://lore.kernel.org/linux-trace-kernel/20260521141833.2353025-1-tglozar@redhat.com/T/
as it uses the new CLI implementation. That in turn depends on the test patches:
- https://lore.kernel.org/linux-trace-kernel/20260423130558.882022-1-tglozar@redhat.com/T/
- https://lore.kernel.org/linux-trace-kernel/20260424140244.958495-1-tglozar@redhat.com/
as it touches tests, as well as a fix touching the CLI:
- https://lore.kernel.org/linux-trace-kernel/20260414185223.65353-1-costa.shul@redhat.com/

 tools/tracing/rtla/src/cli.c      |   2 +
 tools/tracing/rtla/src/cli_p.h    |  17 ++++
 tools/tracing/rtla/src/common.h   |   8 ++
 tools/tracing/rtla/src/osnoise.c  | 149 ++++++++++++++++++++++++++++++
 tools/tracing/rtla/src/osnoise.h  |   6 ++
 tools/tracing/rtla/src/timerlat.c |  18 ++++
 tools/tracing/rtla/src/timerlat.h |   2 +
 7 files changed, 202 insertions(+)

diff --git a/tools/tracing/rtla/src/cli.c b/tools/tracing/rtla/src/cli.c
index 709219341a56..c5279c987531 100644
--- a/tools/tracing/rtla/src/cli.c
+++ b/tools/tracing/rtla/src/cli.c
@@ -248,6 +248,7 @@ struct common_params *timerlat_top_parse_args(int argc, char **argv)
 		RTLA_OPT_USER_THREADS,
 		RTLA_OPT_KERNEL_THREADS,
 		RTLA_OPT_USER_LOAD,
+		TIMERLAT_OPT_ALIGNED,
 
 	OPT_GROUP("Output:"),
 		TIMERLAT_OPT_NANO,
@@ -362,6 +363,7 @@ struct common_params *timerlat_hist_parse_args(int argc, char **argv)
 		RTLA_OPT_USER_THREADS,
 		RTLA_OPT_KERNEL_THREADS,
 		RTLA_OPT_USER_LOAD,
+		TIMERLAT_OPT_ALIGNED,
 
 	OPT_GROUP("Histogram Options:"),
 		HIST_OPT_BUCKET_SIZE,
diff --git a/tools/tracing/rtla/src/cli_p.h b/tools/tracing/rtla/src/cli_p.h
index 3cea4f6e976e..3c939de9abf0 100644
--- a/tools/tracing/rtla/src/cli_p.h
+++ b/tools/tracing/rtla/src/cli_p.h
@@ -447,6 +447,10 @@ static int opt_osnoise_on_end_cb(const struct option *opt, const char *arg, int
 	"set the stack format (truncate, skip, full)", \
 	opt_stack_format_cb)
 
+#define TIMERLAT_OPT_ALIGNED OPT_CALLBACK('A', "aligned", params, "us", \
+	"align thread wakeups to a specific offset", \
+	opt_timerlat_align_cb)
+
 /*
  * Callback functions for command line options for timerlat tools
  */
@@ -608,6 +612,19 @@ static int opt_stack_format_cb(const struct option *opt, const char *arg, int un
 	return 0;
 }
 
+static int opt_timerlat_align_cb(const struct option *opt, const char *arg, int unset)
+{
+	struct timerlat_params *params = opt->value;
+
+	if (unset || !arg)
+		return -1;
+
+	params->timerlat_align = true;
+	params->timerlat_align_us = get_llong_from_str((char *)arg);
+
+	return 0;
+}
+
 /*
  * Macros for command line options specific to histogram-based tools
  */
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 0dfca83bd726..04b287a03f6d 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -51,6 +51,14 @@ struct osnoise_context {
 	/* -1 as init value because 0 is off */
 	int			orig_opt_workload;
 	int			opt_workload;
+
+	/* -1 as init value because 0 is off */
+	int			orig_opt_timerlat_align;
+	int			opt_timerlat_align;
+
+	/* 0 as init value */
+	unsigned long long	orig_timerlat_align_us;
+	unsigned long long	timerlat_align_us;
 };
 
 extern volatile int stop_tracing;
diff --git a/tools/tracing/rtla/src/osnoise.c b/tools/tracing/rtla/src/osnoise.c
index e1e32898af2d..4ff5dad013b1 100644
--- a/tools/tracing/rtla/src/osnoise.c
+++ b/tools/tracing/rtla/src/osnoise.c
@@ -423,6 +423,86 @@ void osnoise_put_timerlat_period_us(struct osnoise_context *context)
 	context->orig_timerlat_period_us = OSNOISE_TIME_INIT_VAL;
 }
 
+/*
+ * osnoise_get_timerlat_align_us - read and save the original "timerlat_align_us"
+ */
+static long long
+osnoise_get_timerlat_align_us(struct osnoise_context *context)
+{
+	long long timerlat_align_us;
+
+	if (context->timerlat_align_us != OSNOISE_OPTION_INIT_VAL)
+		return context->timerlat_align_us;
+
+	if (context->orig_timerlat_align_us != OSNOISE_OPTION_INIT_VAL)
+		return context->orig_timerlat_align_us;
+
+	timerlat_align_us = osnoise_read_ll_config("osnoise/timerlat_align_us");
+	if (timerlat_align_us < 0)
+		goto out_err;
+
+	context->orig_timerlat_align_us = timerlat_align_us;
+	return timerlat_align_us;
+
+out_err:
+	return OSNOISE_OPTION_INIT_VAL;
+}
+
+/*
+ * osnoise_set_timerlat_align_us - set "timerlat_align_us"
+ */
+int osnoise_set_timerlat_align_us(struct osnoise_context *context, long long timerlat_align_us)
+{
+	long long curr_timerlat_align_us = osnoise_get_timerlat_align_us(context);
+	int retval;
+
+	if (curr_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
+		return -1;
+
+	retval = osnoise_write_ll_config("osnoise/timerlat_align_us", timerlat_align_us);
+	if (retval < 0)
+		return -1;
+
+	context->timerlat_align_us = timerlat_align_us;
+
+	return 0;
+}
+
+/*
+ * osnoise_restore_timerlat_align_us - restore "timerlat_align_us"
+ */
+void osnoise_restore_timerlat_align_us(struct osnoise_context *context)
+{
+	int retval;
+
+	if (context->orig_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
+		return;
+
+	if (context->orig_timerlat_align_us == context->timerlat_align_us)
+		goto out_done;
+
+	retval = osnoise_write_ll_config("osnoise/timerlat_align_us",
+				   context->orig_timerlat_align_us);
+	if (retval < 0)
+		err_msg("Could not restore original osnoise timerlat_align_us\n");
+
+out_done:
+	context->timerlat_align_us = OSNOISE_OPTION_INIT_VAL;
+}
+
+/*
+ * osnoise_put_timerlat_align_us - restore original values and cleanup data
+ */
+void osnoise_put_timerlat_align_us(struct osnoise_context *context)
+{
+	osnoise_restore_timerlat_align_us(context);
+
+	if (context->orig_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
+		return;
+
+	context->orig_timerlat_align_us = OSNOISE_OPTION_INIT_VAL;
+}
+
 /*
  * osnoise_get_stop_us - read and save the original "stop_tracing_us"
  */
@@ -908,6 +988,67 @@ static void osnoise_put_workload(struct osnoise_context *context)
 	context->orig_opt_workload = OSNOISE_OPTION_INIT_VAL;
 }
 
+static int osnoise_get_timerlat_align(struct osnoise_context *context)
+{
+	if (context->opt_timerlat_align != OSNOISE_OPTION_INIT_VAL)
+		return context->opt_timerlat_align;
+
+	if (context->orig_opt_timerlat_align != OSNOISE_OPTION_INIT_VAL)
+		return context->orig_opt_timerlat_align;
+
+	context->orig_opt_timerlat_align = osnoise_options_get_option("TIMERLAT_ALIGN");
+
+	return context->orig_opt_timerlat_align;
+}
+
+int osnoise_set_timerlat_align(struct osnoise_context *context, bool onoff)
+{
+	int opt_timerlat_align = osnoise_get_timerlat_align(context);
+	int retval;
+
+	if (opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
+		return -1;
+
+	if (opt_timerlat_align == onoff)
+		return 0;
+
+	retval = osnoise_options_set_option("TIMERLAT_ALIGN", onoff);
+	if (retval < 0)
+		return -2;
+
+	context->opt_timerlat_align = onoff;
+
+	return 0;
+}
+
+static void osnoise_restore_timerlat_align(struct osnoise_context *context)
+{
+	int retval;
+
+	if (context->orig_opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
+		return;
+
+	if (context->orig_opt_timerlat_align == context->opt_timerlat_align)
+		goto out_done;
+
+	retval = osnoise_options_set_option("TIMERLAT_ALIGN", context->orig_opt_timerlat_align);
+	if (retval < 0)
+		err_msg("Could not restore original TIMERLAT_ALIGN option\n");
+
+out_done:
+	context->orig_opt_timerlat_align = OSNOISE_OPTION_INIT_VAL;
+}
+
+static void osnoise_put_timerlat_align(struct osnoise_context *context)
+{
+	osnoise_restore_timerlat_align(context);
+
+	if (context->orig_opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
+		return;
+
+	context->orig_opt_timerlat_align = OSNOISE_OPTION_INIT_VAL;
+}
+
 enum {
 	FLAG_CONTEXT_NEWLY_CREATED	= (1 << 0),
 	FLAG_CONTEXT_DELETED		= (1 << 1),
@@ -960,6 +1101,12 @@ struct osnoise_context *osnoise_context_alloc(void)
 	context->orig_opt_workload	= OSNOISE_OPTION_INIT_VAL;
 	context->opt_workload		= OSNOISE_OPTION_INIT_VAL;
 
+	context->orig_opt_timerlat_align	= OSNOISE_OPTION_INIT_VAL;
+	context->opt_timerlat_align		= OSNOISE_OPTION_INIT_VAL;
+
+	context->orig_timerlat_align_us	= OSNOISE_OPTION_INIT_VAL;
+	context->timerlat_align_us	= OSNOISE_OPTION_INIT_VAL;
+
 	osnoise_get_context(context);
 
 	return context;
@@ -988,6 +1135,8 @@ void osnoise_put_context(struct osnoise_context *context)
 	osnoise_put_tracing_thresh(context);
 	osnoise_put_irq_disable(context);
 	osnoise_put_workload(context);
+	osnoise_put_timerlat_align(context);
+	osnoise_put_timerlat_align_us(context);
 
 	free(context);
 }
diff --git a/tools/tracing/rtla/src/osnoise.h b/tools/tracing/rtla/src/osnoise.h
index 168669aa7e0d..340ff5a64e6e 100644
--- a/tools/tracing/rtla/src/osnoise.h
+++ b/tools/tracing/rtla/src/osnoise.h
@@ -49,6 +49,12 @@ void osnoise_restore_print_stack(struct osnoise_context *context);
 int osnoise_set_print_stack(struct osnoise_context *context,
 			    long long print_stack);
 
+int osnoise_set_timerlat_align_us(struct osnoise_context *context,
+				  long long timerlat_align_us);
+void osnoise_restore_timerlat_align_us(struct osnoise_context *context);
+
+int osnoise_set_timerlat_align(struct osnoise_context *context, bool onoff);
+
 int osnoise_set_irq_disable(struct osnoise_context *context, bool onoff);
 void osnoise_report_missed_events(struct osnoise_tool *tool);
 int osnoise_apply_config(struct osnoise_tool *tool, struct osnoise_params *params);
diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
index f990c8365776..169aa9a6569d 100644
--- a/tools/tracing/rtla/src/timerlat.c
+++ b/tools/tracing/rtla/src/timerlat.c
@@ -77,6 +77,24 @@ timerlat_apply_config(struct osnoise_tool *tool, struct timerlat_params *params)
 		goto out_err;
 	}
 
+	retval = osnoise_set_timerlat_align(tool->context, params->timerlat_align);
+	if (retval && params->timerlat_align) {
+		/*
+		 * We might be running on a kernel that does not support timerlat align.
+		 * Unless user requested it explicitly, ignore the error.
+		 */
+		err_msg("Failed to enable timerlat align\n");
+		goto out_err;
+	}
+
+	if (params->timerlat_align) {
+		retval = osnoise_set_timerlat_align_us(tool->context, params->timerlat_align_us);
+		if (retval) {
+			err_msg("Failed to set timerlat align us\n");
+			goto out_err;
+		}
+	}
+
 	/*
 	 * If the user did not specify a type of thread, try user-threads first.
 	 * Fall back to kernel threads otherwise.
diff --git a/tools/tracing/rtla/src/timerlat.h b/tools/tracing/rtla/src/timerlat.h
index 38ab6b41a15e..84ec6d778183 100644
--- a/tools/tracing/rtla/src/timerlat.h
+++ b/tools/tracing/rtla/src/timerlat.h
@@ -31,6 +31,8 @@ struct timerlat_params {
 	enum timerlat_tracing_mode mode;
 	const char		*bpf_action_program;
 	enum stack_format	stack_format;
+	bool			timerlat_align;
+	unsigned long long	timerlat_align_us;
 };
 
 #define to_timerlat_params(ptr) container_of(ptr, struct timerlat_params, common)
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/3] rtla/tests: Add unit tests for -A/--aligned option
From: Tomas Glozar @ 2026-05-27 14:49 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260527144928.2944472-1-tglozar@redhat.com>

Add both parse_args() and opt_* tests for the newly added -A/--aligned
option.

Assisted-by: Claude:claude-4.5-opus-high-thinking
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 .../rtla/tests/unit/cli_opt_callback.c        | 12 +++++++++++
 .../rtla/tests/unit/timerlat_hist_cli.c       | 20 +++++++++++++++++++
 .../rtla/tests/unit/timerlat_top_cli.c        | 20 +++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/tools/tracing/rtla/tests/unit/cli_opt_callback.c b/tools/tracing/rtla/tests/unit/cli_opt_callback.c
index 01647f4227d1..4a406af42821 100644
--- a/tools/tracing/rtla/tests/unit/cli_opt_callback.c
+++ b/tools/tracing/rtla/tests/unit/cli_opt_callback.c
@@ -545,6 +545,17 @@ START_TEST(test_opt_nano_cb)
 }
 END_TEST
 
+START_TEST(test_opt_timerlat_align_cb)
+{
+	struct timerlat_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_timerlat_align_cb);
+
+	ck_assert_int_eq(opt_timerlat_align_cb(&opt, "500", 0), 0);
+	ck_assert(params.timerlat_align);
+	ck_assert_int_eq(params.timerlat_align_us, 500);
+}
+END_TEST
+
 START_TEST(test_opt_stack_format_cb)
 {
 	int stack_format = 0;
@@ -689,6 +700,7 @@ Suite *cli_opt_callback_suite(void)
 	tcase_add_test(tc, test_opt_nano_cb);
 	tcase_add_test(tc, test_opt_stack_format_cb);
 	tcase_add_exit_test(tc, test_opt_stack_format_cb_invalid, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_timerlat_align_cb);
 	suite_add_tcase(s, tc);
 
 	tc = tcase_create("histogram");
diff --git a/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c b/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c
index 81dc04596cd1..968bf962f53f 100644
--- a/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c
+++ b/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c
@@ -373,6 +373,24 @@ START_TEST(test_user_threads_long)
 }
 END_TEST
 
+START_TEST(test_aligned_short)
+{
+	PARSE_ARGS("timerlat", "hist", "-A", "500");
+
+	ck_assert(tlat_params->timerlat_align);
+	ck_assert_int_eq(tlat_params->timerlat_align_us, 500);
+}
+END_TEST
+
+START_TEST(test_aligned_long)
+{
+	PARSE_ARGS("timerlat", "hist", "--aligned", "500");
+
+	ck_assert(tlat_params->timerlat_align);
+	ck_assert_int_eq(tlat_params->timerlat_align_us, 500);
+}
+END_TEST
+
 /* Histogram Options */
 
 START_TEST(test_bucket_size_short)
@@ -654,6 +672,8 @@ Suite *timerlat_hist_cli_suite(void)
 	tcase_add_test(tc, test_user_load_long);
 	tcase_add_test(tc, test_user_threads_short);
 	tcase_add_test(tc, test_user_threads_long);
+	tcase_add_test(tc, test_aligned_short);
+	tcase_add_test(tc, test_aligned_long);
 	suite_add_tcase(s, tc);
 
 	tc = tcase_create("histogram_options");
diff --git a/tools/tracing/rtla/tests/unit/timerlat_top_cli.c b/tools/tracing/rtla/tests/unit/timerlat_top_cli.c
index 1c39008564c5..33aa6588d503 100644
--- a/tools/tracing/rtla/tests/unit/timerlat_top_cli.c
+++ b/tools/tracing/rtla/tests/unit/timerlat_top_cli.c
@@ -373,6 +373,24 @@ START_TEST(test_user_threads_long)
 }
 END_TEST
 
+START_TEST(test_aligned_short)
+{
+	PARSE_ARGS("timerlat", "top", "-A", "500");
+
+	ck_assert(tlat_params->timerlat_align);
+	ck_assert_int_eq(tlat_params->timerlat_align_us, 500);
+}
+END_TEST
+
+START_TEST(test_aligned_long)
+{
+	PARSE_ARGS("timerlat", "top", "--aligned", "500");
+
+	ck_assert(tlat_params->timerlat_align);
+	ck_assert_int_eq(tlat_params->timerlat_align_us, 500);
+}
+END_TEST
+
 /* Output */
 
 START_TEST(test_nano_short)
@@ -596,6 +614,8 @@ Suite *timerlat_top_cli_suite(void)
 	tcase_add_test(tc, test_user_load_long);
 	tcase_add_test(tc, test_user_threads_short);
 	tcase_add_test(tc, test_user_threads_long);
+	tcase_add_test(tc, test_aligned_short);
+	tcase_add_test(tc, test_aligned_long);
 	suite_add_tcase(s, tc);
 
 	tc = tcase_create("output");
-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/3] Documentation/rtla: Add -A/--aligned option
From: Tomas Glozar @ 2026-05-27 14:49 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260527144928.2944472-1-tglozar@redhat.com>

Cover the newly added -A/--aligned option that aligns timerlat threads
using the corresponding feature of the timerlat tracer.

A note is added to clarify what alignment means, similar to the note in
the tracer implementation in commit 4245bf4dc58f ("tracing/osnoise: Add
option to align tlat threads").

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 Documentation/tools/rtla/common_timerlat_options.txt | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/Documentation/tools/rtla/common_timerlat_options.txt b/Documentation/tools/rtla/common_timerlat_options.txt
index ab159b2cbfe7..0a01fbc93ee1 100644
--- a/Documentation/tools/rtla/common_timerlat_options.txt
+++ b/Documentation/tools/rtla/common_timerlat_options.txt
@@ -95,3 +95,15 @@
         * **full**        Print the entire stack trace, including unknown addresses.
 
         For unknown addresses, the raw pointer is printed.
+
+**-A**, **--aligned** *us*
+
+        Align wake-up of timerlat threads to a set offset in microseconds.
+
+        The alignment will be applied when the threads wake up at the start of tracing while
+        the timer for the first cycle is armed. Each thread sets its timer to the wake-up time
+        of the previous thread plus the alignment.
+
+        This option may be used with any non-negative argument, including zero, which will
+        align threads so that they wake up all at the same time.
+
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3] perf/ftrace: Fix WARNING in __unregister_ftrace_function
From: Rik van Riel @ 2026-05-27 15:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, kernel-team,
	linux-trace-kernel

perf_ftrace_function_unregister() unconditionally calls
unregister_ftrace_function() without checking whether the ftrace_ops
was ever successfully registered. This triggers a WARN_ON in
__unregister_ftrace_function() when the ops doesn't have
FTRACE_OPS_FL_ENABLED set.

This can happen during perf_event_alloc() error cleanup when
perf_trace_destroy() is called via __free_event() on an event whose
ftrace_ops registration failed or was already torn down by
perf_try_init_event()'s err_destroy path.

The call path is:
  perf_event_alloc() error cleanup
    -> __free_event()
      -> event->destroy() [tp_perf_event_destroy]
        -> perf_trace_destroy()
          -> perf_trace_event_close()
            -> TRACE_REG_PERF_CLOSE
              -> perf_ftrace_function_unregister()
                -> unregister_ftrace_function()
                  -> __unregister_ftrace_function()
                    -> WARN_ON(!(ops->flags & FTRACE_OPS_FL_ENABLED))

Fix this by checking FTRACE_OPS_FL_ENABLED before attempting to
unregister. If the ops is not enabled, just free the filter and
return success.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/trace/trace_event_perf.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a6bb7577e8c5..5b272856e5ab 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -497,7 +497,17 @@ static int perf_ftrace_function_register(struct perf_event *event)
 static int perf_ftrace_function_unregister(struct perf_event *event)
 {
 	struct ftrace_ops *ops = &event->ftrace_ops;
-	int ret = unregister_ftrace_function(ops);
+	int ret = 0;
+
+	/*
+	 * Perf will call this unconditionally even if the ops is not
+	 * enabled. The unregister_ftrace_function() will warn if called
+	 * when not enabled. Just bypass the unregistering if ops isn't
+	 * enabled here.
+	 */
+	if (ops->flags & FTRACE_OPS_FL_ENABLED)
+		ret = unregister_ftrace_function(ops);
+
 	ftrace_free_filter(ops);
 	return ret;
 }
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Steven Rostedt @ 2026-05-27 15:30 UTC (permalink / raw)
  To: Wang Han
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Masami Hiramatsu, Mark Rutland, Catalin Marinas, Chen Pei,
	Andy Chiu, Björn Töpel, Deepak Gupta, Puranjay Mohan,
	Conor Dooley, Josh Poimboeuf, Jiri Kosina, Miroslav Benes,
	Petr Mladek, Joe Lawrence, Shuah Khan, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260527123530.2593918-2-wanghan@linux.alibaba.com>

On Wed, 27 May 2026 20:35:23 +0800
Wang Han <wanghan@linux.alibaba.com> wrote:

> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
>  scripts/sorttable.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/sorttable.c b/scripts/sorttable.c
> index e8ed11c680c6..b4061c2c03e1 100644
> --- a/scripts/sorttable.c
> +++ b/scripts/sorttable.c
> @@ -901,11 +901,17 @@ static int do_file(char const *const fname, void *addr)
>  		/* fallthrough */
>  	case EM_386:
>  	case EM_LOONGARCH:
> -	case EM_RISCV:
>  	case EM_S390:
>  	case EM_X86_64:
>  		custom_sort = sort_relative_table_with_data;
>  		break;
> +	case EM_RISCV:
> +#ifdef MCOUNT_SORT_ENABLED
> +		/* RISC-V uses patchable function entries before function entry. */
> +		before_func = 8;
> +#endif
> +		custom_sort = sort_relative_table_with_data;
> +		break;
>  	case EM_PARISC:
>  	case EM_PPC:
>  	case EM_PPC64:

So basically RISCV has the same problem as ARM64 with patchable
entries. As this may happen for other archs in the future, I would like
to group them together like this:

diff --git a/scripts/sorttable.c b/scripts/sorttable.c
index e8ed11c680c6..b3d9073d9fbc 100644
--- a/scripts/sorttable.c
+++ b/scripts/sorttable.c
@@ -891,17 +891,23 @@ static int do_file(char const *const fname, void *addr)
 	table_sort_t custom_sort = NULL;
 
 	switch (elf_map_machine(ehdr)) {
-	case EM_AARCH64:
 #ifdef MCOUNT_SORT_ENABLED
+	case EM_AARCH64:
 		sort_reloc = true;
 		rela_type = 0x403;
-		/* arm64 uses patchable function entry placing before function */
+		/*
+		 * arm64 and RISCV use patchable function entry placing
+		 * before function
+		 */
+	case RISCV:
 		before_func = 8;
+#else
+	case EM_AARCH64:
+	case RISCV:
 #endif
 		/* fallthrough */
 	case EM_386:
 	case EM_LOONGARCH:
-	case EM_RISCV:
 	case EM_S390:
 	case EM_X86_64:
 		custom_sort = sort_relative_table_with_data;

does the above work for you? (Although I didn't even compile test it).

-- Steve

^ permalink raw reply related

* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng @ 2026-05-27 15:35 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgEZ9vCKkoMC11tVrueAonGWH2x6OeaYYxXGEj2gwHUaKw@mail.gmail.com>

Ackerley Tng <ackerleytng@google.com> writes:

>
> [...snip...]
>
>>
>> Hmm, I wonder if we can figure out a way to consolidate some documentation,
>> because this is _exactly_ the same pattern that x86's host_pfn_mapping_level()
>> deals with (see its big comment below).
>>
>
> This would be great, are you thinking an actual comment or something in
> Documentation/?
>
> Perhaps we could iterate on this a little with me providing the newbie
> perspective. Do you want me to take a stab at writing something up?
>

Please see https://lore.kernel.org/all/20260527-kvm-locking-docs-v1-0-4fe8b602ff47@google.com/T/!

>>
>> [...snip...]
>>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox