* Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()
From: Tengda Wu @ 2026-06-09 11:12 UTC (permalink / raw)
To: Petr Mladek
Cc: Masami Hiramatsu, Peter Zijlstra, Steven Rostedt,
Mathieu Desnoyers, Alexei Starovoitov, linux-trace-kernel,
linux-kernel, live-patching
In-Reply-To: <aifgPaLc_p2vZr5n@pathway.suse.cz>
On 2026/6/9 17:43, Petr Mladek wrote:
> Added live-patching mailing list.
>
> On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
>> The current check in rethook_find_ret_addr() prevents obtaining a return
>> address when the target task is marked as running. However, this condition
>> is both insufficient for correctness and unnecessary for its intended
>> purpose.
>>
>> The check is inherently racy: a task can begin running on another CPU
>> immediately after task_is_running() returns false, potentially leading to
>> concurrent modification of rethook data structures while the iteration is
>> in progress.
>>
>> Rather than trying to fix this unreliable check deep in the unwinding
>> path, simply remove it. The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
>> even if the iteration goes off the rails and returns invalid information,
>> it will not crash. Callers that require consistency must provide a safe
>> context themselves.
>>
>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
>> ---
>> v3: Improve commit message: clarify safety semantics and document that RCU guarantees no crash.
>> v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@huaweicloud.com/
>> v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/
>>
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>> if (WARN_ON_ONCE(!cur))
>> return 0;
>>
>> - if (tsk != current && task_is_running(tsk))
>> - return 0;
>> -
>
> The description of the function should be updated as well. It still
> mentions:
>
> * The @tsk must be 'current' or a task which is not running.
>
> Instead it should explain that it safe to call the function even
> on another running tasks but the returned address is not reliable
> then.
>
Oh, I forgot that. Thanks for pointing it out.
>> do {
>> ret = __rethook_find_ret_addr(tsk, cur);
>> if (!ret)
>
> I am still a bit concerned about the motivation.
>
> Tengda mentioned at
> https://lore.kernel.org/all/679a1c8f-1e4d-4ae5-83e1-d0068e6de1a6@huaweicloud.com/
> that they tried to verify livepatching:
>
> <paste>
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
>
> We construct a function foo() that calls bar():
>
> void bar(void)
> {
> for (;;) {
> schedule();
> }
> }
>
> void foo(void)
> {
> bar();
> }
>
> A kretprobe is attached to bar():
>
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
>
> After that, the live patch is activated to attempt replacing the implementation
> of foo(). The expectation is that this should succeed.
>
> However, in reality, because the task that called schedule() is still in the
> RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.
>
> **Not sure if this is correct:**
>
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.
> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.
> </paste>
>
> More background:
> ----------------
>
> The test is artificial because it keeps the RUNNING state before
> calling schedule, see
> https://lore.kernel.org/all/20260608093449.GH4149641@noisy.programming.kicks-ass.net/
>
>
>
> My questions:
>
> Does this patch allows to livepatch the above mentioned test code?
At least it will no longer be blocked by the rethook check.
> Is the livepatching safe?
> Does it help in another scenarios?
>
>
> My opinion:
>
> The livepatching might be safe only when the process is migrating
> itself. I mean that it might be safe even when it is RUNNING as long
> at it is _current_.
>
> I agree that we do not need to enforce this in rethook_find_ret_addr()
> if the function is used also in other scenarios, for example, by
> ftrace/BTF for taking snapshots of other processes.
>
> But we need to make sure that the backtrace is reliable when
> livepatching (migrating) the task.
>
> Best Regards,
> Petr
Peter actually touched on this point earlier:
<paste>
Things like live-patch use task_call_func(), which ensures the callback
function is done while holding sufficient locks for the task to not
change state.
</paste>
From my understanding, removing this check should not introduce
stack reliability issues for livepatch.
Thanks,
Tengda
^ permalink raw reply
* Re: [PATCH] samples/ftrace: reject zero ftrace-ops call count
From: Samuel Moelius @ 2026-06-09 11:26 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mark Rutland, open list:FUNCTION HOOKS (FTRACE),
open list:FUNCTION HOOKS (FTRACE)
In-Reply-To: <20260608210324.6af37e65@gandalf.local.home>
On Mon, Jun 8, 2026 at 9:03 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 9 Jun 2026 00:44:23 +0000
> Samuel Moelius <sam.moelius@trailofbits.com> wrote:
>
> > The ftrace-ops sample exposes nr_function_calls as a module parameter
> > and uses it as the divisor when printing the measured time per call.
> > Loading the module with nr_function_calls=0 skips the benchmark loop and
> > then divides the elapsed time by zero, crashing the kernel during sample
> > module initialization.
>
> This change is rather pointless, but whatever.
>
> >
> > Reject a zero call count before registering any ftrace ops.
> >
> > Assisted-by: Codex:gpt-5.5-cyber-preview
> > Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
> > ---
> > samples/ftrace/ftrace-ops.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/samples/ftrace/ftrace-ops.c b/samples/ftrace/ftrace-ops.c
> > index 68d6685c80bd..d5adaa61484f 100644
> > --- a/samples/ftrace/ftrace-ops.c
> > +++ b/samples/ftrace/ftrace-ops.c
> > @@ -190,6 +190,11 @@ static int __init ftrace_ops_sample_init(void)
> > tracer_irrelevant = ops_func_count;
> > }
> >
> > + if (!nr_function_calls) {
> > + pr_err("nr_function_calls must be non-zero\n");
> > + return -EINVAL;
>
> No need to print that the admin did something stupid.
>
> > + }
> > +
> > pr_info("registering:\n"
> > " relevant ops: %u\n"
> > " tracee: %ps\n"
>
> In fact, I would just change the output to be:
>
> pr_info("Attempted %u calls to %ps in %lluns (%lluns / call)\n",
> nr_function_calls, tracee_relevant,
> period, nr_function_calls ? div_u64(period, nr_function_calls) : -1LL);
>
> and have garbage in, garbage out.
Is it okay to keep the same subject line or should I change it?
^ permalink raw reply
* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-06-09 11:44 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <CAEf4BzZiFUeFa4fdghuSGpqiEzyvJVxs6TDgJd5U4hSj--imiQ@mail.gmail.com>
On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote:
> On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Andrii reported an issue with optimized uprobes [1] that can clobber
> > redzone area with call instruction storing return address on stack
> > where user code may keep temporary data without adjusting rsp.
> >
> > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > instruction, so we can squeeze another instruction to escape the
> > redzone area before doing the call, like:
> >
> > lea -0x80(%rsp), %rsp
> > call tramp
> >
> > Note the lea instruction is used to adjust the rsp register without
> > changing the flags.
> >
> > We use nop10 and following transformation to optimized instructions
> > above and back as suggested by Peterz [2].
> >
> > Optimize path (int3_update_optimize):
> >
> > 1) Initial state after set_swbp() installed the uprobe:
> > cc 2e 0f 1f 84 00 00 00 00 00
> >
> > From offset 0 this is INT3 followed by the tail of the original
> > 10-byte NOP.
> >
> > After a previous unoptimization bytes 5..9 may still contain the
> > old call instruction, which remains valid for threads already there.
> >
> > 2) Rewrite the LEA tail and call displacement:
> > cc [8d 64 24 80 e8 d0 d1 d2 d3]
> >
> > From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not
> > executable entry points while byte 0 is trapped.
> >
> > 3) Publish the first LEA byte:
> > [48] 8d 64 24 80 e8 d0 d1 d2 d3
> >
> > From offset 0 this is:
> > lea -0x80(%rsp), %rsp
> > call <uprobe-trampoline>
> >
> > Unoptimize path (int3_update_unoptimize):
> >
> > 1) Initial optimized state:
> > 48 8d 64 24 80 e8 d0 d1 d2 d3
> > Same as 3) above.
> >
> > 2) Trap new entries before restoring the NOP bytes:
> > [cc] 8d 64 24 80 e8 d0 d1 d2 d3
> >
> > From offset 0 this traps. A thread that had already executed the
> > LEA can still reach the intact CALL at offset 5.
> >
> > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > and byte 5 as CALL.
> > cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> >
> > From offset 0 this still traps. Offset 5 is still the CALL for any
> > thread that was already past the first LEA byte.
> >
> > 4) Publish the first byte of the original NOP:
> > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
> >
> > From offset 0 this is the restored 10-byte NOP; the CALL opcode and
> > displacement are now only NOP operands. Offset 5 still decodes as
> > CALL for a thread that was already there.
> >
> > Tthere is only a single target uprobe-trampoline for the given nop10
> > instruction address, so the CALL instruction will not be changed across
> > unoptimization/optimization cycles.
> > Therefore, any task that is preempted at the CALL instruction is guaranteed
> > to observe that CALL and not anything else.
> >
> > Note as explained in [2] we need to use following nop10:
> > PF1 PF2 ESC NOPL MOD SIB DISP32
> > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> >
> > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> > attribute in is_prefix_bad function.
> >
> > Also changing the uprobe syscall error when called out of uprobe
> > trampoline to -EPROTO, so we are able to detect the fixed kernel.
> >
> > The optimized uprobe performance stays the same:
> >
> > uprobe-nop : 3.129 ± 0.013M/s
> > uprobe-push : 3.045 ± 0.006M/s
> > uprobe-ret : 1.095 ± 0.004M/s
> > --> uprobe-nop10 : 7.170 ± 0.020M/s
> > uretprobe-nop : 2.143 ± 0.021M/s
> > uretprobe-push : 2.090 ± 0.000M/s
> > uretprobe-ret : 0.942 ± 0.000M/s
> > --> uretprobe-nop10: 3.381 ± 0.003M/s
> > usdt-nop : 3.245 ± 0.004M/s
> > --> usdt-nop10 : 7.256 ± 0.023M/s
> >
> > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Assisted-by: Codex:GPT-5.5
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> > arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
> > 1 file changed, 190 insertions(+), 65 deletions(-)
> >
>
> [...]
>
> > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > smp_text_poke_sync_each_cpu();
> >
> > /*
> > - * Write first byte.
> > + * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > + * and byte 5 as CALL:
> > + * cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > + */
> > + ctx.expect = EXPECT_SWBP_OPTIMIZED;
> > + err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> > + LEA_INSN_SIZE - 1, verify_insn,
> > + true /* is_register */, false /* do_update_ref_ctr */,
>
> tbh, it's quite subtle and non-obvious why is_register should be set
> to true first two times (and especially that is_register and
> do_update_ref_ctr are implicitly connected), not sure how to make it
> cleaner, but maybe leave a short comment explaining this twice
> register, once unregister sequence?
ok, I came up with comment below
thanks,
jirka
---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index de544516ea70..92449f34c005 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
int err;
/*
+ * Note the first two uprobe_write calls use is_register=true, because they
+ * are intermediate patching states while the probe is still active.
+ *
+ * The last uprobe_write to nop10 instruction is called with is_register=false
+ * and do_update_ref_ctr=true to trigger the refctr update.
+ *
* 1) Initial optimized state:
* 48 8d 64 24 80 e8 d0 d1 d2 d3
*
^ permalink raw reply related
* Re: [PATCH v2] rtla: Stop the record trace on interrupt
From: Wander Lairson Costa @ 2026-06-09 12:05 UTC (permalink / raw)
To: Crystal Wood
Cc: Tomas Glozar, Steven Rostedt, linux-trace-kernel, John Kacur,
Costa Shulyupin
In-Reply-To: <20260512173731.2151841-1-crwood@redhat.com>
On Tue, May 12, 2026 at 12:37:31PM -0500, Crystal Wood wrote:
> Before, when rtla got a signal, it stopped the main trace but not the
> record trace. With "--on-end trace", this can lead to
> save_trace_to_file() failing to keep up, especially on a debug kernel.
> Plus, it adds post-stoppage noise to the trace file.
>
> Signed-off-by: Crystal Wood <crwood@redhat.com>
> ---
> v2: clarify that this matters for --on-end trace
>
> tools/tracing/rtla/src/common.c | 19 +++++++++++--------
> tools/tracing/rtla/src/common.h | 1 -
> tools/tracing/rtla/src/timerlat.c | 2 +-
> 3 files changed, 12 insertions(+), 10 deletions(-)
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
^ permalink raw reply
* Re: [PATCH v2] rtla: Document tests in README
From: Wander Lairson Costa @ 2026-06-09 12:07 UTC (permalink / raw)
To: Tomas Glozar
Cc: Steven Rostedt, John Kacur, Luis Goncalves, Crystal Wood,
Costa Shulyupin, LKML, linux-trace-kernel
In-Reply-To: <20260514073038.204428-1-tglozar@redhat.com>
On Thu, May 14, 2026 at 09:30:38AM +0200, Tomas Glozar wrote:
> RTLA tests are not documented anywhere. Mention both runtime and unit
> tests in the README, with instructions on how to run them and a list of
> dependencies and required system configuration.
>
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---
> v2: Add package hints for common distros for Test::Harness (suggested by
> Crystal Wood).
>
> v1: https://lore.kernel.org/linux-trace-kernel/20260423130759.882247-1-tglozar@redhat.com
>
> tools/tracing/rtla/README.txt | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
^ permalink raw reply
* Re: [PATCH v2 0/6] rtla: Migrate to libsubcmd for command line option parsing
From: Wander Lairson Costa @ 2026-06-09 12:10 UTC (permalink / raw)
To: Tomas Glozar
Cc: Steven Rostedt, John Kacur, Luis Goncalves, Crystal Wood,
Costa Shulyupin, Ivan Pravdin, Namhyung Kim, Ian Rogers,
Arnaldo Carvalho de Melo, LKML, linux-trace-kernel,
linux-perf-users
In-Reply-To: <20260521141833.2353025-1-tglozar@redhat.com>
On Thu, May 21, 2026 at 04:18:27PM +0200, Tomas Glozar wrote:
> [ CC to linux-perf-users for the libsubcmd code changes ]
>
> rtla currently uses its own implementation that uses getopt_long() to
> parse command-line arguments.
>
> Migrate rtla to use libsubcmd for command line argument parsing,
> similarly to what is already done by other tools like perf, bpftool,
> and objtool. Among other benefits, this allows help messages to be
> generated automatically rather than having to be typed out manually
> for each tool.
>
> libsubcmd is extended with a flag to parse optarg from separate
> argument if a new flag is turned on. Without the flag, the old behavior
> is preserved. That keeps the parsing working for tools that use
> positional arguments, and allows RTLA to keep its flexible syntax for -C
> and -t options and their long variants, --cgroup and --trace-output.
> Another flag is added to disable automatic definition of --no-xy for
> every option --xy and vice versa, which overlaps for RTLA's --irq and
> --thread options.
>
> The new implementation is moved into a separate file, cli.c, together
> with a tiny header counterpart, cli.h. This helps separate the parsing
> logic, which has little in common with the rest of RTLA, in a separate
> module. Another new file, cli_p.h, is used as a private header to contain
> macros and static function declarations that are also used by unit tests
> next to cli.c, but should not be imported from elsewhere.
>
> Macros to generate struct option array fields for libsubcmd's
> parse_args() are used to preserve the consolidation of argument parsing
> code across different RTLA tools. Kernel and user threads are, as
> an exception, treated as common, although they are currently implemented
> for timerlat only, in line with earlier consolidation changes.
>
> The test suite is expanded to include two levels of unit tests, one testing
> the already existing tool_parse_args() functions, one tests option callbacks,
> which are a new level of the CLI parser added in this patchset. This helps
> to verify that no regressions are caused by this refactoring.
>
> I expect more improvements to the code being possible in the future,
> like creating macros for option groups to further deduplicate the code,
> reducing the amount of extra code in the _parse_args() functions, or
> implementing support for unsetting options (which is currently only
> supported for those that do not use a custom callback).
>
> v2 changes:
> - Two unit test suites are added to cover regressions, after several
> options were broken by the v1. The test suites cover all parsing
> issues reported in the v1 as well as those found during additional
> testing.
> - Return value of all paths that print help, including those that
> are handled in RTLA, are set to libsubcmd's help exit code of 129.
> Previously, only the tool help returned 129. While some other tools
> (e.g. bpftool) do that, RTLA unifies those for consistency. The return
> value is also added to the corresponding section in documentation.
> - Incorrect parsing of --no-irq and --no-thread is fixed using a newly
> added libsubcmd option flag.
> - Incorrect parsing of -n and -u timerlat options (which erroneously
> required an argument in v1) is fixed.
> - A now stale declaration of removed function common_parse_options()
> is removed from common.h.
> - Segmentation fault on abbreviated --help (e.g. --he) is fixed.
> - Incorrect formatting of OPT_END macro (spurious tab) is fixed.
> - All opt_* callbacks now reject unimplemented unset (--no-) correctly.
> - opt_trigger_cb() and opt_filter_cb() now take only the required events
> field, not the whole params structure.
> - Spurious opt_osnoise_threshold_cb() which is actually just a wrapper for
> opt_llong_callback() is removed.
> - Old off-by-one typo is fixed in --dma-latency and -E/--entries error
> messages, to make it consistent with the newly added unit tests.
> - Fix a bug in Makefile that defined LIBSUBCMD_INCLUDES as -I... and then
> used it as a target name: define it as the path only, and then add -I$(...)
> to CFLAGS, as this is the only include path that is generated during the build
> itself.
>
> v1: https://lore.kernel.org/linux-trace-kernel/20260320150651.51057-1-tglozar@redhat.com/T
>
> Tomas Glozar (6):
> rtla: Add libsubcmd dependency
> tools subcmd: support optarg as separate argument
> tools subcmd: allow parsing distinct --opt and --no-opt
> rtla: Parse cmdline using libsubcmd
> rtla/tests: Add unit tests for _parse_args() functions
> rtla/tests: Add unit tests for CLI option callbacks
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
^ permalink raw reply
* Re: [PATCH v3 0/6] rtla: Migrate to libsubcmd for command line option parsing
From: Wander Lairson Costa @ 2026-06-09 12:12 UTC (permalink / raw)
To: Tomas Glozar
Cc: Steven Rostedt, John Kacur, Luis Goncalves, Crystal Wood,
Costa Shulyupin, Ivan Pravdin, Namhyung Kim, Ian Rogers,
Arnaldo Carvalho de Melo, LKML, linux-trace-kernel,
linux-perf-users
In-Reply-To: <20260528103254.2990068-1-tglozar@redhat.com>
On Thu, May 28, 2026 at 12:32:48PM +0200, Tomas Glozar wrote:
> [ CC to linux-perf-users for the libsubcmd code changes ]
>
> rtla currently uses its own implementation that uses getopt_long() to
> parse command-line arguments.
>
> Migrate rtla to use libsubcmd for command line argument parsing,
> similarly to what is already done by other tools like perf, bpftool,
> and objtool. Among other benefits, this allows help messages to be
> generated automatically rather than having to be typed out manually
> for each tool.
>
> libsubcmd is extended with a flag to parse optarg from separate
> argument if a new flag is turned on. Without the flag, the old behavior
> is preserved. That keeps the parsing working for tools that use
> positional arguments, and allows RTLA to keep its flexible syntax for -C
> and -t options and their long variants, --cgroup and --trace-output.
> Another flag is added to disable automatic definition of --no-xy for
> every option --xy and vice versa, which overlaps for RTLA's --irq and
> --thread options.
>
> The new implementation is moved into a separate file, cli.c, together
> with a tiny header counterpart, cli.h. This helps separate the parsing
> logic, which has little in common with the rest of RTLA, in a separate
> module. Another new file, cli_p.h, is used as a private header to contain
> macros and static function declarations that are also used by unit tests
> next to cli.c, but should not be imported from elsewhere.
>
> Macros to generate struct option array fields for libsubcmd's
> parse_args() are used to preserve the consolidation of argument parsing
> code across different RTLA tools. Kernel and user threads are, as
> an exception, treated as common, although they are currently implemented
> for timerlat only, in line with earlier consolidation changes.
>
> The test suite is expanded to include two levels of unit tests, one testing
> the already existing tool_parse_args() functions, one tests option callbacks,
> which are a new level of the CLI parser added in this patchset. This helps
> to verify that no regressions are caused by this refactoring.
>
> I expect more improvements to the code being possible in the future,
> like creating macros for option groups to further deduplicate the code,
> reducing the amount of extra code in the _parse_args() functions, or
> implementing support for unsetting options (which is currently only
> supported for those that do not use a custom callback).
>
> Base commit:
> - https://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git/commit/?h=rtla-for-next&id=f03a59f949176ce4312cb466245d1243aaf40389
>
> Dependencies:
> - https://lore.kernel.org/linux-trace-kernel/20260423130558.882022-1-tglozar@redhat.com/T/
> - https://lore.kernel.org/linux-trace-kernel/20260424140244.958495-1-tglozar@redhat.com/
> - https://lore.kernel.org/linux-trace-kernel/20260414185223.65353-1-costa.shul@redhat.com/
> (apply in reverse order, alternatively, use base commit above)
>
> v3 changes (all in cover letter or first commit):
> - Add FORCE to all targets that feature a make subcommand to ensure changes
> in the dependency will trigger rebuild correctly.
> - Convert all dependencies on directory targets LIBSUBCMD_OUTPUT and
> LIB_OUTPUT into order-only prerequisites to prevent modifications of
> the directory content triggering rebuild of targets inside it.
> - Properly mention depedencies of the patchset in the cover letter, as well
> as the base commit.
>
> v2: https://lore.kernel.org/linux-trace-kernel/20260521141833.2353025-1-tglozar@redhat.com/T/
>
Oops. I mistakenly added the reviewed-by tag to the v2 patch series.
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
^ permalink raw reply
* Re: [PATCH v3 05/13] verification/rvgen: Convert __fill_setup_invariants_func() to Lark
From: Gabriele Monaco @ 2026-06-09 12:39 UTC (permalink / raw)
To: Nam Cao
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <e5f62eab0210df74ba5dea5c99e218b39d331274.1780908661.git.namcao@linutronix.de>
On Mon, 2026-06-08 at 10:57 +0200, Nam Cao wrote:
> + def __parse_invariant(self, inv):
> + # by default assume the timer has ns expiration
> + clock_type = "ns"
> + if inv.unit == "j":
> + clock_type = "jiffy"
> +
> + env = inv.env + self.enum_suffix
> + val = inv.val.replace("()", "(ha_mon)")
> +
> + match inv.unit:
> + case "us":
> + val *= 10**3
> + case "ms":
> + val *= 10**6
> + case "s":
> + val *= 10**9
> +
I need to add a test case for this.. You can check by using something like 5 us
in stall.dot in place of threshold_jiffies.
val is a string, doing val * 1000 repeats the string 1000 times in python!
We need to convert it like we do in __adjust_value(). Something like:
diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index a080b92334..4d39f229c9 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -244,7 +244,11 @@ class ha2k(dot2k):
clock_type = "jiffy"
env = inv.env + self.enum_suffix
- val = inv.val.replace("()", "(ha_mon)")
+ try:
+ val = int(inv.val)
+ except ValueError:
+ # it's a constant, a parameter or a function
+ val = inv.val.replace("()", "(ha_mon)")
match inv.unit:
case "us":
Thanks,
Gabriele
^ permalink raw reply related
* Re: [PATCH v3 03/13] verification/rvgen: Implement state and transition parser based on Lark
From: Gabriele Monaco @ 2026-06-09 12:54 UTC (permalink / raw)
To: Nam Cao
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <8964bf8b7c20fa1a5b8ef2a7081cf8ba11d70526.1780908661.git.namcao@linutronix.de>
On Mon, 2026-06-08 at 10:56 +0200, Nam Cao wrote:
> +class ConstraintRule:
> + grammar = r'''
> + CMP_OP: "==" | "<=" | "<" | ">=" | ">"
Just realised (well, a local version of sashiko did) that we're missing !=
That's technically supported and present in the docs.
Thanks,
Gabriele
^ permalink raw reply
* Re: [PATCH v3 03/13] verification/rvgen: Implement state and transition parser based on Lark
From: Gabriele Monaco @ 2026-06-09 13:23 UTC (permalink / raw)
To: Nam Cao
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <8964bf8b7c20fa1a5b8ef2a7081cf8ba11d70526.1780908661.git.namcao@linutronix.de>
On Mon, 2026-06-08 at 10:56 +0200, Nam Cao wrote:
> +class ConstraintRule:
> + grammar = r'''
> + rule: condition (OP condition)*
> +
> + OP: "&&" | "||"
> +
> + condition: ENV CMP_OP VAL UNIT?
> +
> + ENV: CNAME
> +
> + CMP_OP: "==" | "<=" | "<" | ">=" | ">"
> +
> + VAL: /[0-9]+/
> + | /[A-Z_]+\(\)/
> + | /[A-Z_]+/
> + | /[a-z_]+\(\)/
> + | /[a-z_]+/
> +
> + UNIT: "ns" | "us" | "ms" | "s"
> + '''
One more (that sashiko couldn't find), we're talking about "j" as a
unit, it should be allowed also on literals (so we need to add it as a
valid UNIT).
Thanks,
Gabriele
^ permalink raw reply
* Re: [PATCH v2 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Steven Rostedt @ 2026-06-09 14:21 UTC (permalink / raw)
To: Shuai Xue
Cc: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Alexandre Ghiti, Masami Hiramatsu, Mark Rutland, Catalin Marinas,
Chen Pei, Andy Chiu, Björn Töpel, Deepak Gupta,
Puranjay Mohan, Conor Dooley, Josh Poimboeuf, Jiri Kosina,
Miroslav Benes, Petr Mladek, Joe Lawrence, Shuah Khan,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
linux-perf-users
In-Reply-To: <6753385b-c2ab-4ed0-9400-78ce967450ba@linux.alibaba.com>
On Wed, 3 Jun 2026 10:10:44 +0800
Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> It's a pure comment cosmetic, not worth a respin on its own. But for the
I don't know. I've respinned for better comments before.
> rest of the feedback on this series (the frame-record metadata contract
> in patch 2 and the dead state->regs field / Call Trace output change in
> patch 6) are the ones actually worth a new version.
>
> Just to get the routing straight: are you planning to pick this one up
> through the tracing tree on its own?
>
> It feels like a good candidate for that -- it's an independent
> regression fix (Fixes: 0ca1724b56af) that breaks *all* RISC-V dynamic
> ftrace, not just livepatch, so it shouldn't have to wait on the rest of
> the livepatch series.
As it affects RISCV and I don't have a RISCV to test, I would feel more
comfortable with it going through a RISCV tree that can test the patch.
Feel free to add:
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
-- Steve
^ permalink raw reply
* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Steven Rostedt @ 2026-06-09 14:29 UTC (permalink / raw)
To: Jiri Olsa
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, Menglong Dong
In-Reply-To: <20260603110554.29590-4-jolsa@kernel.org>
On Wed, 3 Jun 2026 13:05:27 +0200
Jiri Olsa <jolsa@kernel.org> wrote:
> Renaming __add_hash_entry to add_ftrace_hash_entry and making it global,
> it will be used in following changes outside ftrace.c object.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
> include/linux/ftrace.h | 1 +
> kernel/trace/ftrace.c | 9 ++++-----
> 2 files changed, 5 insertions(+), 5 deletions(-)
It's getting late in the RC release and I'm guessing this series will
likely not make the next merge window. The first three patches are
rather trivial and since the rest of the series depends on them, I'm
happy to take them and apply them for the next merge window. That way
you can continue development in the next rotation without needing these
ftrace patches.
Would that work for you?
-- Steve
^ permalink raw reply
* Re: [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-09 14:41 UTC (permalink / raw)
To: Breno Leitao, Miaohe Lin, Andrew Morton, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-2-432a74002e74@debian.org>
On 6/9/26 12:56, Breno Leitao wrote:
> get_any_page() collapses every HWPoisonHandlable() rejection into a
> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
> -> retry path. That is correct for the transient case (a userspace
> folio briefly off LRU during migration or compaction, which a later
> shake can drag back), but wrong for stable kernel-owned pages: slab,
> page-table, large-kmalloc and PG_reserved pages will never become
> HWPoisonHandlable(), so the retry loop is wasted work and the final
> -EIO loses the "this is structurally unrecoverable" information.
> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
> panic-on-unrecoverable sysctl deliberately does not act on.
>
> Introduce HWPoisonKernelOwned(), a small predicate that positively
> identifies pages the hwpoison handler cannot recover from:
>
> HWPoisonKernelOwned(p, flags) :=
> !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
> (PageReserved(p) ||
> PageSlab(head) || PageTable(head) || PageLargeKmalloc(head))
>
> where head = compound_head(p).
>
> PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
> page directly. The slab, page-table and large-kmalloc page-type bits
> are only stored on the head page, so those tests resolve the compound
> head first, then re-read compound_head(page) afterwards: a concurrent
> split or compound free that moves head invalidates the just-read flags
> and the loop retries. The lookup still takes no refcount, mirroring
> the rest of get_any_page(); the recheck closes the common split race,
> and a residual free->alloc->free in the same window can only mis-tag
> a genuinely poisoned page, never reclassify a handlable one.
>
> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
> same exception in HWPoisonHandlable(): soft-offline is allowed to
> migrate movable_ops pages even though they are not on the LRU, and
> we must not pre-empt that with an unrecoverable verdict.
>
> The list is intentionally not exhaustive. vmalloc and kernel-stack
> pages, for example, do not carry a page_type bit and would need a
> different oracle; they keep going through the existing retry path
> unchanged. This is the smallest set we can identify with certainty
> by page type.
>
> Wire the helper into the top of get_any_page() to short-circuit
> those pages before the retry loop runs. On a hit, drop the caller's
> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
> straight away. Pages outside the helper's positive list still take
> the existing retry path and return -EIO, leaving operator-visible
> behaviour for those cases unchanged.
>
> Extend the unhandlable-page pr_err() to fire for either errno and
> update the get_hwpoison_page() kerneldoc to document the new return.
>
> memory_failure() still folds every negative return into
> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> this patch on its own only changes the errno that soft_offline_page()
> can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
> through memory_failure() and reports MF_MSG_KERNEL for the
> unrecoverable cases, which is what the
> panic_on_unrecoverable_memory_failure sysctl observes.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Suggested-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> mm/memory-failure.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f4d3e6e20e13..eed9de387694 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1325,6 +1325,46 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
> return PageLRU(page) || is_free_buddy_page(page);
> }
>
> +/*
> + * Positive identification of pages the hwpoison handler cannot recover.
> + * These page types are owned by kernel internals (no userspace mapping
> + * to unmap, no file mapping to invalidate, no migration target), so the
> + * shake_page() / retry loop in get_any_page() can never turn them into
> + * something HWPoisonHandlable() will accept. Short-circuit them to
> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
> + * spinning through retries that exit as a transient-looking -EIO.
> + *
> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
> + * pages even though they are not on the LRU.
> + */
> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
> +{
> + struct page *head;
> +
> + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
> + return false;
> +
On a second look: Do we really need that? The page types below never support
migration. So I guess that check is not required?
Apart from that, looks good with two comments:
a) HWPoisonKernelOwned: this is not the common style for us to name functions.
is_kernel_owned_page() or sth like that would do.
b) The function doc can likely be simplified a bit. No need to mention the
short-circuit stuff, for example, IMHO.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Kumar Kartikeya Dwivedi @ 2026-06-09 14:43 UTC (permalink / raw)
To: Steven Rostedt, Jiri Olsa
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, Menglong Dong
In-Reply-To: <20260609102922.3c489f76@fedora>
On Tue Jun 9, 2026 at 4:29 PM CEST, Steven Rostedt wrote:
> On Wed, 3 Jun 2026 13:05:27 +0200
> Jiri Olsa <jolsa@kernel.org> wrote:
>
>> Renaming __add_hash_entry to add_ftrace_hash_entry and making it global,
>> it will be used in following changes outside ftrace.c object.
>>
>> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
>> ---
>> include/linux/ftrace.h | 1 +
>> kernel/trace/ftrace.c | 9 ++++-----
>> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> It's getting late in the RC release and I'm guessing this series will
> likely not make the next merge window. The first three patches are
> rather trivial and since the rest of the series depends on them, I'm
> happy to take them and apply them for the next merge window. That way
> you can continue development in the next rotation without needing these
> ftrace patches.
>
> Would that work for you?
Hi Steven,
Version 8 of this set was already applied to bpf-next.
https://lore.kernel.org/bpf/178085644764.273544.8250000589480262551.git-patchwork-notify@kernel.org
>
> -- Steve
^ permalink raw reply
* [ANNOUNCE/CFP] Tracing Microconference at Linux Plumbers
From: Steven Rostedt @ 2026-06-09 15:02 UTC (permalink / raw)
To: LKML, linux-trace-users@vger.kernel.org, Linux trace kernel,
Linux Trace Devel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland,
Arnaldo Carvalho de Melo, Namhyung Kim, Peter Zijlstra,
Ian Rogers, Dylan Hatch, Josh Poimboeuf, Jens Remus, Indu Bhagat,
Beau Belgrave, Tomas Glozar, Gabriele Monaco
The Tracing Microconference is accepting new topic proposals.
https://lpc.events/event/20/contributions/2316/
All Linux kernel tracing related topics are appropriate to discuss. If
there's a Linux kernel tracing issue that you are having or a feature
you would like to see, please submit your problem statement at:
https://lpc.events/event/20/abstracts/
Make sure to select "Tracing MC" in the "Track" pulldown.
Topics may include but not limited to;
o Stack tracing features
o Better lock statistic tracing
o Better system call tracepoint recording
o Updates to the persistent ring buffer
o More features for dynamic events
o Better integration between perf, ftrace and BPF
o Integration with BTF
The above is just ideas for topics, but is not an exhausted list. Feel
free to submit anything you think is relevant.
For what is expected for a microconference, please read the blog on
"The Ideal Microconference Topic Session":
https://lpc.events/blog/current/index.php/2023/06/26/the-ideal-microconference-topic-session/
Note, those that submit topics before July 10th will become eligible to
pre-register for the conference.
The CFP itself closes on August 7th.
Looking forward to seeing you at Linux Plumbers in Prague!
-- Steve
^ permalink raw reply
* Re: [PATCH 1/3] tracing/user_events: Simplify data output in user_seq_show()
From: Steven Rostedt @ 2026-06-09 16:13 UTC (permalink / raw)
To: Markus Elfring
Cc: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers, LKML,
kernel-janitors
In-Reply-To: <b95f18bb-6603-4c22-b224-0226a49d26b3@web.de>
On Thu, 4 Jun 2026 14:07:45 +0200
Markus Elfring <Markus.Elfring@web.de> wrote:
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -2800,8 +2800,7 @@ static int user_seq_show(struct seq_file *m, void *p)
>
> mutex_unlock(&group->reg_mutex);
>
> - seq_puts(m, "\n");
> - seq_printf(m, "Active: %d\n", active);
> + seq_printf(m, "\nActive: %d\n", active);
> seq_printf(m, "Busy: %d\n", busy);
This isn't a critical section and I find the original way easier to
read. But the other two patches are fine.
-- Steve
^ permalink raw reply
* Re: [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-09 16:15 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Miaohe Lin, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <174b8d76-5514-4942-af5d-c975ff95ee03@kernel.org>
On Tue, Jun 09, 2026 at 04:41:01PM +0200, David Hildenbrand (Arm) wrote:
> On 6/9/26 12:56, Breno Leitao wrote:
> > get_any_page() collapses every HWPoisonHandlable() rejection into a
> > single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
> > -> retry path. That is correct for the transient case (a userspace
> > folio briefly off LRU during migration or compaction, which a later
> > shake can drag back), but wrong for stable kernel-owned pages: slab,
> > page-table, large-kmalloc and PG_reserved pages will never become
> > HWPoisonHandlable(), so the retry loop is wasted work and the final
> > -EIO loses the "this is structurally unrecoverable" information.
> > memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
> > panic-on-unrecoverable sysctl deliberately does not act on.
> >
> > Introduce HWPoisonKernelOwned(), a small predicate that positively
> > identifies pages the hwpoison handler cannot recover from:
> >
> > HWPoisonKernelOwned(p, flags) :=
> > !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
> > (PageReserved(p) ||
> > PageSlab(head) || PageTable(head) || PageLargeKmalloc(head))
> >
> > where head = compound_head(p).
> >
> > PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
> > page directly. The slab, page-table and large-kmalloc page-type bits
> > are only stored on the head page, so those tests resolve the compound
> > head first, then re-read compound_head(page) afterwards: a concurrent
> > split or compound free that moves head invalidates the just-read flags
> > and the loop retries. The lookup still takes no refcount, mirroring
> > the rest of get_any_page(); the recheck closes the common split race,
> > and a residual free->alloc->free in the same window can only mis-tag
> > a genuinely poisoned page, never reclassify a handlable one.
> >
> > The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
> > same exception in HWPoisonHandlable(): soft-offline is allowed to
> > migrate movable_ops pages even though they are not on the LRU, and
> > we must not pre-empt that with an unrecoverable verdict.
> >
> > The list is intentionally not exhaustive. vmalloc and kernel-stack
> > pages, for example, do not carry a page_type bit and would need a
> > different oracle; they keep going through the existing retry path
> > unchanged. This is the smallest set we can identify with certainty
> > by page type.
> >
> > Wire the helper into the top of get_any_page() to short-circuit
> > those pages before the retry loop runs. On a hit, drop the caller's
> > MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
> > straight away. Pages outside the helper's positive list still take
> > the existing retry path and return -EIO, leaving operator-visible
> > behaviour for those cases unchanged.
> >
> > Extend the unhandlable-page pr_err() to fire for either errno and
> > update the get_hwpoison_page() kerneldoc to document the new return.
> >
> > memory_failure() still folds every negative return into
> > MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> > this patch on its own only changes the errno that soft_offline_page()
> > can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
> > through memory_failure() and reports MF_MSG_KERNEL for the
> > unrecoverable cases, which is what the
> > panic_on_unrecoverable_memory_failure sysctl observes.
> >
> > Suggested-by: David Hildenbrand <david@kernel.org>
> > Suggested-by: Lance Yang <lance.yang@linux.dev>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> > mm/memory-failure.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 58 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index f4d3e6e20e13..eed9de387694 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1325,6 +1325,46 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
> > return PageLRU(page) || is_free_buddy_page(page);
> > }
> >
> > +/*
> > + * Positive identification of pages the hwpoison handler cannot recover.
> > + * These page types are owned by kernel internals (no userspace mapping
> > + * to unmap, no file mapping to invalidate, no migration target), so the
> > + * shake_page() / retry loop in get_any_page() can never turn them into
> > + * something HWPoisonHandlable() will accept. Short-circuit them to
> > + * -ENOTRECOVERABLE so callers can panic on operator request instead of
> > + * spinning through retries that exit as a transient-looking -EIO.
> > + *
> > + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
> > + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
> > + * pages even though they are not on the LRU.
> > + */
> > +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
> > +{
> > + struct page *head;
> > +
> > + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
> > + return false;
> > +
>
> On a second look: Do we really need that? The page types below never support
> migration. So I guess that check is not required?
>
> Apart from that, looks good with two comments:
>
> a) HWPoisonKernelOwned: this is not the common style for us to name functions.
>
> is_kernel_owned_page() or sth like that would do.
Ack, I will rename it is_kernel_owned_page()
In my defence, most of the functions similar to HWPoisonKernelOwned()
has this name format, and I got this discussion earlier (with Lance?
I think). Here are the similar function names in that file:
* HWPoisonHandlable
* PageHWPoisonTakenOff()
* SetPageHWPoisonTakenOff
I will update in the new version.
> b) The function doc can likely be simplified a bit. No need to mention the
> short-circuit stuff, for example, IMHO.
Ack
Thanks for the review,
--breno
^ permalink raw reply
* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-06-09 16:43 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <aif8jkUPst-tTskE@krava>
On Tue, Jun 9, 2026 at 4:44 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote:
> > On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > redzone area with call instruction storing return address on stack
> > > where user code may keep temporary data without adjusting rsp.
> > >
> > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > instruction, so we can squeeze another instruction to escape the
> > > redzone area before doing the call, like:
> > >
> > > lea -0x80(%rsp), %rsp
> > > call tramp
> > >
> > > Note the lea instruction is used to adjust the rsp register without
> > > changing the flags.
> > >
> > > We use nop10 and following transformation to optimized instructions
> > > above and back as suggested by Peterz [2].
> > >
> > > Optimize path (int3_update_optimize):
> > >
> > > 1) Initial state after set_swbp() installed the uprobe:
> > > cc 2e 0f 1f 84 00 00 00 00 00
> > >
> > > From offset 0 this is INT3 followed by the tail of the original
> > > 10-byte NOP.
> > >
> > > After a previous unoptimization bytes 5..9 may still contain the
> > > old call instruction, which remains valid for threads already there.
> > >
> > > 2) Rewrite the LEA tail and call displacement:
> > > cc [8d 64 24 80 e8 d0 d1 d2 d3]
> > >
> > > From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not
> > > executable entry points while byte 0 is trapped.
> > >
> > > 3) Publish the first LEA byte:
> > > [48] 8d 64 24 80 e8 d0 d1 d2 d3
> > >
> > > From offset 0 this is:
> > > lea -0x80(%rsp), %rsp
> > > call <uprobe-trampoline>
> > >
> > > Unoptimize path (int3_update_unoptimize):
> > >
> > > 1) Initial optimized state:
> > > 48 8d 64 24 80 e8 d0 d1 d2 d3
> > > Same as 3) above.
> > >
> > > 2) Trap new entries before restoring the NOP bytes:
> > > [cc] 8d 64 24 80 e8 d0 d1 d2 d3
> > >
> > > From offset 0 this traps. A thread that had already executed the
> > > LEA can still reach the intact CALL at offset 5.
> > >
> > > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > and byte 5 as CALL.
> > > cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > >
> > > From offset 0 this still traps. Offset 5 is still the CALL for any
> > > thread that was already past the first LEA byte.
> > >
> > > 4) Publish the first byte of the original NOP:
> > > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
> > >
> > > From offset 0 this is the restored 10-byte NOP; the CALL opcode and
> > > displacement are now only NOP operands. Offset 5 still decodes as
> > > CALL for a thread that was already there.
> > >
> > > Tthere is only a single target uprobe-trampoline for the given nop10
> > > instruction address, so the CALL instruction will not be changed across
> > > unoptimization/optimization cycles.
> > > Therefore, any task that is preempted at the CALL instruction is guaranteed
> > > to observe that CALL and not anything else.
> > >
> > > Note as explained in [2] we need to use following nop10:
> > > PF1 PF2 ESC NOPL MOD SIB DISP32
> > > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> > >
> > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> > > attribute in is_prefix_bad function.
> > >
> > > Also changing the uprobe syscall error when called out of uprobe
> > > trampoline to -EPROTO, so we are able to detect the fixed kernel.
> > >
> > > The optimized uprobe performance stays the same:
> > >
> > > uprobe-nop : 3.129 ± 0.013M/s
> > > uprobe-push : 3.045 ± 0.006M/s
> > > uprobe-ret : 1.095 ± 0.004M/s
> > > --> uprobe-nop10 : 7.170 ± 0.020M/s
> > > uretprobe-nop : 2.143 ± 0.021M/s
> > > uretprobe-push : 2.090 ± 0.000M/s
> > > uretprobe-ret : 0.942 ± 0.000M/s
> > > --> uretprobe-nop10: 3.381 ± 0.003M/s
> > > usdt-nop : 3.245 ± 0.004M/s
> > > --> usdt-nop10 : 7.256 ± 0.023M/s
> > >
> > > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > > Assisted-by: Codex:GPT-5.5
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > > arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
> > > 1 file changed, 190 insertions(+), 65 deletions(-)
> > >
> >
> > [...]
> >
> > > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > > smp_text_poke_sync_each_cpu();
> > >
> > > /*
> > > - * Write first byte.
> > > + * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > + * and byte 5 as CALL:
> > > + * cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > + */
> > > + ctx.expect = EXPECT_SWBP_OPTIMIZED;
> > > + err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> > > + LEA_INSN_SIZE - 1, verify_insn,
> > > + true /* is_register */, false /* do_update_ref_ctr */,
> >
> > tbh, it's quite subtle and non-obvious why is_register should be set
> > to true first two times (and especially that is_register and
> > do_update_ref_ctr are implicitly connected), not sure how to make it
> > cleaner, but maybe leave a short comment explaining this twice
> > register, once unregister sequence?
>
> ok, I came up with comment below
>
> thanks,
> jirka
>
>
> ---
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index de544516ea70..92449f34c005 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
> int err;
>
> /*
> + * Note the first two uprobe_write calls use is_register=true, because they
> + * are intermediate patching states while the probe is still active.
this doesn't really explain why is_register=true is the right one. It
actually doesn't matter as long as do_update_ref_ctr=true, isn't that
right? So maybe just to avoid a bit of confusion let's pass
is_register=false and do_update_ref_ctr=false, and in the comment
explain as you said that it's intermediate update and we don't want to
update refctr just yet until the very last step?
> + *
> + * The last uprobe_write to nop10 instruction is called with is_register=false
> + * and do_update_ref_ctr=true to trigger the refctr update.
> + *
> * 1) Initial optimized state:
> * 48 8d 64 24 80 e8 d0 d1 d2 d3
> *
^ permalink raw reply
* Re: [PATCH 1/3] tracing/user_events: Simplify data output in user_seq_show()
From: Markus Elfring @ 2026-06-09 16:44 UTC (permalink / raw)
To: Steven Rostedt, linux-trace-kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, LKML, kernel-janitors
In-Reply-To: <20260609121348.303ca675@fedora>
>> @@ -2800,8 +2800,7 @@ static int user_seq_show(struct seq_file *m, void *p)
>>
>> mutex_unlock(&group->reg_mutex);
>>
>> - seq_puts(m, "\n");
>> - seq_printf(m, "Active: %d\n", active);
>> + seq_printf(m, "\nActive: %d\n", active);
>> seq_printf(m, "Busy: %d\n", busy);
>
> This isn't a critical section and I find the original way easier to read.
Would you prefer to use a seq_putc() call instead at such a source code place?
https://elixir.bootlin.com/linux/v7.1-rc7/source/kernel/trace/trace_events_user.c#L2803
> But the other two patches are fine.
Thanks for another bit of positive feedback.
Regards,
Markus
^ permalink raw reply
* Re: [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-09 18:41 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <aig7jzwDHfVCxikl@gmail.com>
On 6/9/26 18:15, Breno Leitao wrote:
> On Tue, Jun 09, 2026 at 04:41:01PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/9/26 12:56, Breno Leitao wrote:
>>> get_any_page() collapses every HWPoisonHandlable() rejection into a
>>> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
>>> -> retry path. That is correct for the transient case (a userspace
>>> folio briefly off LRU during migration or compaction, which a later
>>> shake can drag back), but wrong for stable kernel-owned pages: slab,
>>> page-table, large-kmalloc and PG_reserved pages will never become
>>> HWPoisonHandlable(), so the retry loop is wasted work and the final
>>> -EIO loses the "this is structurally unrecoverable" information.
>>> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
>>> panic-on-unrecoverable sysctl deliberately does not act on.
>>>
>>> Introduce HWPoisonKernelOwned(), a small predicate that positively
>>> identifies pages the hwpoison handler cannot recover from:
>>>
>>> HWPoisonKernelOwned(p, flags) :=
>>> !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
>>> (PageReserved(p) ||
>>> PageSlab(head) || PageTable(head) || PageLargeKmalloc(head))
>>>
>>> where head = compound_head(p).
>>>
>>> PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
>>> page directly. The slab, page-table and large-kmalloc page-type bits
>>> are only stored on the head page, so those tests resolve the compound
>>> head first, then re-read compound_head(page) afterwards: a concurrent
>>> split or compound free that moves head invalidates the just-read flags
>>> and the loop retries. The lookup still takes no refcount, mirroring
>>> the rest of get_any_page(); the recheck closes the common split race,
>>> and a residual free->alloc->free in the same window can only mis-tag
>>> a genuinely poisoned page, never reclassify a handlable one.
>>>
>>> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
>>> same exception in HWPoisonHandlable(): soft-offline is allowed to
>>> migrate movable_ops pages even though they are not on the LRU, and
>>> we must not pre-empt that with an unrecoverable verdict.
>>>
>>> The list is intentionally not exhaustive. vmalloc and kernel-stack
>>> pages, for example, do not carry a page_type bit and would need a
>>> different oracle; they keep going through the existing retry path
>>> unchanged. This is the smallest set we can identify with certainty
>>> by page type.
>>>
>>> Wire the helper into the top of get_any_page() to short-circuit
>>> those pages before the retry loop runs. On a hit, drop the caller's
>>> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
>>> straight away. Pages outside the helper's positive list still take
>>> the existing retry path and return -EIO, leaving operator-visible
>>> behaviour for those cases unchanged.
>>>
>>> Extend the unhandlable-page pr_err() to fire for either errno and
>>> update the get_hwpoison_page() kerneldoc to document the new return.
>>>
>>> memory_failure() still folds every negative return into
>>> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>>> this patch on its own only changes the errno that soft_offline_page()
>>> can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
>>> through memory_failure() and reports MF_MSG_KERNEL for the
>>> unrecoverable cases, which is what the
>>> panic_on_unrecoverable_memory_failure sysctl observes.
>>>
>>> Suggested-by: David Hildenbrand <david@kernel.org>
>>> Suggested-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/memory-failure.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 58 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index f4d3e6e20e13..eed9de387694 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1325,6 +1325,46 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>>> return PageLRU(page) || is_free_buddy_page(page);
>>> }
>>>
>>> +/*
>>> + * Positive identification of pages the hwpoison handler cannot recover.
>>> + * These page types are owned by kernel internals (no userspace mapping
>>> + * to unmap, no file mapping to invalidate, no migration target), so the
>>> + * shake_page() / retry loop in get_any_page() can never turn them into
>>> + * something HWPoisonHandlable() will accept. Short-circuit them to
>>> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
>>> + * spinning through retries that exit as a transient-looking -EIO.
>>> + *
>>> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
>>> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
>>> + * pages even though they are not on the LRU.
>>> + */
>>> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
>>> +{
>>> + struct page *head;
>>> +
>>> + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
>>> + return false;
>>> +
>>
>> On a second look: Do we really need that? The page types below never support
>> migration. So I guess that check is not required?
>>
>> Apart from that, looks good with two comments:
>>
>> a) HWPoisonKernelOwned: this is not the common style for us to name functions.
>>
>> is_kernel_owned_page() or sth like that would do.
>
> Ack, I will rename it is_kernel_owned_page()
>
> In my defence, most of the functions similar to HWPoisonKernelOwned()
> has this name format, and I got this discussion earlier (with Lance?
> I think). Here are the similar function names in that file:
>
> * HWPoisonHandlable
> * PageHWPoisonTakenOff()
> * SetPageHWPoisonTakenOff
Some of these probably date back to our old way of handling page flags and
things, like PageLRU.
But we really should stop :)
>
> I will update in the new version.
Thanks! Probably best to wait a bit, the merge window is coming up either way,
so this will have to wait a bit either way.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] tracing: reject invalid preemptirq_delay_test CPU affinity
From: Steven Rostedt @ 2026-06-09 22:16 UTC (permalink / raw)
To: Samuel Moelius
Cc: Masami Hiramatsu, Mathieu Desnoyers, open list:TRACING,
open list:TRACING
In-Reply-To: <20260605004006.2041148-1-sam.moelius@trailofbits.com>
On Fri, 5 Jun 2026 00:40:06 +0000
Samuel Moelius <sam.moelius@trailofbits.com> wrote:
> preemptirq_delay_test accepts cpu_affinity as a module parameter and,
> when it is non-negative, writes that CPU directly into a temporary
> cpumask from the worker thread. Values outside nr_cpu_ids can set a
> bit outside the allocated cpumask before the test reports a normal
> affinity error.
>
> Validate the requested CPU before starting the worker thread, and
> return -EINVAL for invalid affinity requests.
>
> Assisted-by: Codex:gpt-5.5-cyber-preview
> Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
> ---
> kernel/trace/preemptirq_delay_test.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
> index acb0c971a408..0f017799754a 100644
> --- a/kernel/trace/preemptirq_delay_test.c
> +++ b/kernel/trace/preemptirq_delay_test.c
> @@ -14,6 +14,7 @@
> #include <linux/kthread.h>
> #include <linux/module.h>
> #include <linux/printk.h>
> +#include <linux/cpumask.h>
> #include <linux/string.h>
> #include <linux/sysfs.h>
> #include <linux/completion.h>
> @@ -152,6 +153,15 @@ static int preemptirq_run_test(void)
> struct task_struct *task;
> char task_name[50];
>
> + if (cpu_affinity > -1) {
> + unsigned int cpu = cpu_affinity;
> +
> + if (cpu >= nr_cpu_ids || !cpu_possible(cpu)) {
> + pr_err("cpu_affinity:%d, invalid CPU\n", cpu_affinity);
> + return -EINVAL;
Just add the check to the preemptirq_delay_run() function where it
tests affinity. Who cares if it created the thread or not. It's just a
test.
-- Steve
> + }
> + }
> +
> init_completion(&done);
>
> snprintf(task_name, sizeof(task_name), "%s_test", test_mode);
^ permalink raw reply
* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-10 0:07 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <CAGsJ_4xk2Q3xTkQYMPb7rnvnvVsjp24uXRcOja5RxqSbQpdoEg@mail.gmail.com>
On 6/9/26 12:44 AM, Barry Song wrote:
> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
>>
>> LRU add batches can be drained before they reach capacity. This can be a
>> source of LRU lock contention, but it is not currently possible to
>> attribute these drains to callers with existing tracepoints.
>>
>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>> lru_add batch is drained. This allows tracing to distinguish full drains
>> from partial drains and attribute them to the calling stack.
>>
>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
>> per-CPU drain work. This captures the requester stack and target CPU for
>> remote drain work. The event is named as a drain-all queue event because
>> the queued work can be needed for batches other than lru_add.
>>
>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>> ---
>> include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
>> mm/swap.c | 6 ++++-
>> 2 files changed, 45 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>> index 171524d3526d..ea8fc46bedb0 100644
>> --- a/include/trace/events/pagemap.h
>> +++ b/include/trace/events/pagemap.h
>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
>> TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>> );
>>
>> +TRACE_EVENT(mm_lru_add_drain,
>> +
>> + TP_PROTO(int cpu, unsigned int nr),
>> +
>> + TP_ARGS(cpu, nr),
>> +
>> + TP_STRUCT__entry(
>> + __field(int, cpu )
>> + __field(unsigned int, nr )
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->cpu = cpu;
>> + __entry->nr = nr;
>> + ),
>> +
>> + TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>> +);
>> +
>> +TRACE_EVENT(mm_lru_drain_all_queue,
>> +
>> + TP_PROTO(int target_cpu, bool force_all_cpus),
>> +
>> + TP_ARGS(target_cpu, force_all_cpus),
>> +
>> + TP_STRUCT__entry(
>> + __field(int, target_cpu )
>> + __field(bool, force_all_cpus )
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->target_cpu = target_cpu;
>> + __entry->force_all_cpus = force_all_cpus;
>> + ),
>> +
>> + TP_printk("target_cpu=%d force_all_cpus=%s",
>> + __entry->target_cpu,
>> + __entry->force_all_cpus ? "true" : "false")
>> +);
>> +
>> #endif /* _TRACE_PAGEMAP_H */
>>
>> /* This part must be outside protection */
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 588f50d8f1a8..c385b93582eb 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>> {
>> struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>> struct folio_batch *fbatch = &fbatches->lru_add;
>> + unsigned int nr_folios_add = folio_batch_count(fbatch);
>>
>> - if (folio_batch_count(fbatch))
>> + if (nr_folios_add) {
>> folio_batch_move_lru(fbatch, lru_add);
>> + trace_mm_lru_add_drain(cpu, nr_folios_add);
>> + }
>>
>> fbatch = &fbatches->lru_move_tail;
>> /* Disabling interrupts below acts as a compiler barrier. */
>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>> if (cpu_needs_drain(cpu)) {
>> INIT_WORK(work, lru_add_drain_per_cpu);
>> queue_work_on(cpu, mm_percpu_wq, work);
>> + trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
>
> Do you need tracing on each CPU individually, or is tracing the
> entire __lru_add_drain_all() invocation sufficient?
I think the latter would be fine. The remote work will invoke the
mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
the event already has the CPU, we could see where queued drains actually
ran.
>
> Do you also need this_gen and lru_drain_gen to be traced?
As trace parameters, I don't think they give meaningful info on who is
making requests to drain all. But since I'm going to move the trace call
from within the CPU loop to earlier in the function we can still see
requestors even if the function exits early because a parallel drain
generation already satisfied the request.
>
> By the way, I'm not sure drain_all_queue is the best name here.
> Why not simply use add_drain_all()? It would match the existing
> function name better.
The new tracepoint name can be mm_lru_add_drain_all.
^ permalink raw reply
* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-10 0:16 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <3e55e520-b979-4b1c-874c-3b4e5ca629e2@linux.dev>
On 6/9/26 5:07 PM, JP Kobryn wrote:
> On 6/9/26 12:44 AM, Barry Song wrote:
>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
>>>
>>> LRU add batches can be drained before they reach capacity. This can be a
>>> source of LRU lock contention, but it is not currently possible to
>>> attribute these drains to callers with existing tracepoints.
>>>
>>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>>> lru_add batch is drained. This allows tracing to distinguish full drains
>>> from partial drains and attribute them to the calling stack.
>>>
>>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
>>> per-CPU drain work. This captures the requester stack and target CPU for
>>> remote drain work. The event is named as a drain-all queue event because
>>> the queued work can be needed for batches other than lru_add.
>>>
>>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>>> ---
>>> include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
>>> mm/swap.c | 6 ++++-
>>> 2 files changed, 45 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>>> index 171524d3526d..ea8fc46bedb0 100644
>>> --- a/include/trace/events/pagemap.h
>>> +++ b/include/trace/events/pagemap.h
>>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
>>> TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>>> );
>>>
>>> +TRACE_EVENT(mm_lru_add_drain,
>>> +
>>> + TP_PROTO(int cpu, unsigned int nr),
>>> +
>>> + TP_ARGS(cpu, nr),
>>> +
>>> + TP_STRUCT__entry(
>>> + __field(int, cpu )
>>> + __field(unsigned int, nr )
>>> + ),
>>> +
>>> + TP_fast_assign(
>>> + __entry->cpu = cpu;
>>> + __entry->nr = nr;
>>> + ),
>>> +
>>> + TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>>> +);
>>> +
>>> +TRACE_EVENT(mm_lru_drain_all_queue,
>>> +
>>> + TP_PROTO(int target_cpu, bool force_all_cpus),
>>> +
>>> + TP_ARGS(target_cpu, force_all_cpus),
>>> +
>>> + TP_STRUCT__entry(
>>> + __field(int, target_cpu )
>>> + __field(bool, force_all_cpus )
>>> + ),
>>> +
>>> + TP_fast_assign(
>>> + __entry->target_cpu = target_cpu;
>>> + __entry->force_all_cpus = force_all_cpus;
>>> + ),
>>> +
>>> + TP_printk("target_cpu=%d force_all_cpus=%s",
>>> + __entry->target_cpu,
>>> + __entry->force_all_cpus ? "true" : "false")
>>> +);
>>> +
>>> #endif /* _TRACE_PAGEMAP_H */
>>>
>>> /* This part must be outside protection */
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 588f50d8f1a8..c385b93582eb 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>>> {
>>> struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>>> struct folio_batch *fbatch = &fbatches->lru_add;
>>> + unsigned int nr_folios_add = folio_batch_count(fbatch);
>>>
>>> - if (folio_batch_count(fbatch))
>>> + if (nr_folios_add) {
>>> folio_batch_move_lru(fbatch, lru_add);
>>> + trace_mm_lru_add_drain(cpu, nr_folios_add);
>>> + }
>>>
>>> fbatch = &fbatches->lru_move_tail;
>>> /* Disabling interrupts below acts as a compiler barrier. */
>>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>>> if (cpu_needs_drain(cpu)) {
>>> INIT_WORK(work, lru_add_drain_per_cpu);
>>> queue_work_on(cpu, mm_percpu_wq, work);
>>> + trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
>>
>> Do you need tracing on each CPU individually, or is tracing the
>> entire __lru_add_drain_all() invocation sufficient?
>
> I think the latter would be fine. The remote work will invoke the
> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
> the event already has the CPU, we could see where queued drains actually
> ran.
Actually if it's just a single invocation and the only event data is the
force flag, a tracepoint may not even be needed. Other probes can be
installed on function invocation and read the single argument. I can
drop this from v2 and keep the single mm_lru_add_drain tracepoint.
^ permalink raw reply
* [PATCH v2] mm/lruvec: trace LRU add drains
From: JP Kobryn @ 2026-06-10 0:29 UTC (permalink / raw)
To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park
Cc: linux-kernel, linux-trace-kernel
LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.
Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.
Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
---
v2:
- removed mm_lru_drain_all tracepoint
v1: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/
include/trace/events/pagemap.h | 19 +++++++++++++++++++
mm/swap.c | 5 ++++-
2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..fc28745507be 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,25 @@ TRACE_EVENT(mm_lru_activate,
TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
);
+TRACE_EVENT(mm_lru_add_drain,
+
+ TP_PROTO(int cpu, unsigned int nr),
+
+ TP_ARGS(cpu, nr),
+
+ TP_STRUCT__entry(
+ __field(int, cpu )
+ __field(unsigned int, nr )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->nr = nr;
+ ),
+
+ TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
+);
+
#endif /* _TRACE_PAGEMAP_H */
/* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..8fd6808bfc6e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
{
struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
struct folio_batch *fbatch = &fbatches->lru_add;
+ unsigned int nr_folios_add = folio_batch_count(fbatch);
- if (folio_batch_count(fbatch))
+ if (nr_folios_add) {
folio_batch_move_lru(fbatch, lru_add);
+ trace_mm_lru_add_drain(cpu, nr_folios_add);
+ }
fbatch = &fbatches->lru_move_tail;
/* Disabling interrupts below acts as a compiler barrier. */
--
2.54.0
^ permalink raw reply related
* [RFC PATCH v2 0/7] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-10 0:51 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
Hi,
Here is the 2nd version of series to introduce more typecast features
to probe events. The previous version is here:
https://lore.kernel.org/all/178092865666.163648.10457567771536160909.stgit@devnote2/
In this version, I fixed various problems Sashiko reviewed and add
a fix of sample code. Also drop +CPU/PCPU() and introduce this_cpu_read().
Steve introduced BTF typecast feature for eprobe[1].
This series extends it and add more options:
1. Expanding BTF typecast to kprobe and fprobe.
(currently only function entry/exit)
2. Introduce container_of like typecast. This adds a "assigned
member" option to the typecast.
(STRUCT,MEMBER)VAR->ANOTHER_MEMBER
This casts VAR to STRUCT type but the VAR is as the address
of STRUCT.MEMBER. In C, it is:
container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER
3. Support nested typecast, e.g.
(STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER
the nest level must be smaller than 3.
4. Add $current variable to point "current" task_struct.
This is useful with typecast, e.g.
(task_struct)$current->pid
5. per-cpu dereference support.
Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
access per-cpu data on the current CPU (accessing other CPU
data is not stable, because it can be changed.)
You can access the member of per-cpu data structure using
typecast like:
(STRUCT)this_cpu_ptr(VAR)->MEMBER
And added a test script to test part of them.
[1] https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/
---
Masami Hiramatsu (Google) (7):
tracing/events: Fix to check the simple_tsk_fn creation
tracing/probes: Support typecast for various probe events
tracing/probes: Support nested typecast
tracing/probes: Support field specifier option for typecast
tracing/probes: Add $current variable support
tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
tracing/probes: Add a new testcase for BTF typecasts
Documentation/trace/eprobetrace.rst | 10
Documentation/trace/fprobetrace.rst | 10
Documentation/trace/kprobetrace.rst | 11 +
kernel/trace/trace.c | 6
kernel/trace/trace_probe.c | 404 +++++++++++++++-----
kernel/trace/trace_probe.h | 18 +
kernel/trace/trace_probe_tmpl.h | 33 +-
samples/trace_events/trace-events-sample.c | 56 ++-
samples/trace_events/trace-events-sample.h | 34 ++
.../ftrace/test.d/dynevent/btf_probe_event.tc | 51 +++
10 files changed, 509 insertions(+), 124 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox