* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-05-12 16:47 UTC (permalink / raw)
To: Jens Remus
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <43158d95-b4c2-44d2-a244-eb546fb2bfaa@linux.ibm.com>
On Fri, 8 May 2026 09:46:30 +0200
Jens Remus <jremus@linux.ibm.com> wrote:
> > STACKTRACE_REGISTER_SFRAME - This registers the sframe
> > STACKTRACE_UNREGISTER_SFRAME - This removes the sframe
> >
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>
> LGTM. Some comments/questions below.
Note, after talking with people at LSF/MM/BPF, I plan on completely
changing this system call into two distinct ones, and only for sframes.
I'll be sending that later this week.
>
> > diff --git a/include/uapi/linux/stacktrace.h b/include/uapi/linux/stacktrace.h
>
> > @@ -0,0 +1,10 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_STACKTRACE_H
> > +#define _UAPI_LINUX_STACKTRACE_H
> > +
> > +enum stacktrace_setup_types {
> > + STACKTRACE_REGISTER_SFRAME = 1,
> > + STACKTRACE_UNREGISTER_SFRAME = 2,
> > +};
> > +
> > +#endif /* _UAPI_LINUX_STACKTRACE_H */
>
> > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
>
> Having the syscall live in kernel/unwind/sframe.c means it is only
> available if config option HAVE_UNWIND_USER_SFRAME is selected (which
> triggers sframe.o to be built and linked into the kernel), which makes
> sense as long as it only implements sframe-specific functionality.
> I suppose it could be moved elsewhere if non-sframe use cases would
> arise in the future?
The new system calls will only be for sframes. Other unwinders will need to
implement their own system calls.
>
> Would Dylan need to guard it when introducing HAVE_UNWIND_KERNEL_SFRAME?
> Provided the syscall fails with -ENOSYS if not implemented (e.g. when
> HAVE_UNWIND_USER_SFRAME is not enabled) the dummy implementations of
> sframe_add_section() and sframe_remove_section() in linux/sframe.h also
> return -ENOSYS, so the user observable behavior would be the same and
> it would not matter. Do you agree?
I'll reply to that when Dylan's patches get closer to acceptance ;-)
>
> > @@ -12,8 +12,10 @@
> > #include <linux/mm.h>
> > #include <linux/string_helpers.h>
> > #include <linux/sframe.h>
> > +#include <linux/syscalls.h>
> > #include <asm/unwind_user_sframe.h>
> > #include <linux/unwind_user_types.h>
> > +#include <uapi/linux/stacktrace.h>
> >
> > #include "sframe.h"
> > #include "sframe_debug.h"
> > @@ -838,3 +840,38 @@ void sframe_free_mm(struct mm_struct *mm)
> >
> > mtree_destroy(&mm->sframe_mt);
> > }
> > +
> > +/**
> > + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> > + * @op: Type of operation to perform
> > + * @addr_start: The virtual address of the stacktrace information
> > + * @addr_length: The length of the stacktrace information
> > + * @text_start: The virtual address of the text that @addr_start represents
> > + * @text_length: The length of teh text
> > + *
> > + * This system call is used by dynamic library utilities to inform the kernel
> > + * of meta data that it loaded that can be used by the kernel to know how
> > + * to stack walk the given text locations.
> > + *
> > + * Currently only sframes are supported, but in the future, this may be used
> > + * to tell the kernel about JIT code which will most likely have a different
> > + * format.
> > + *
> > + * The type command may be extended and parameters may be used for other
> > + * purposes.
> > + *
> > + * Return: 0 if successful, otherwise a negative error.
> > + */
> > +SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
> > + unsigned long, addr_length, unsigned long, text_start,
> > + unsigned long, text_length)
>
> Would it make sense to keep the parameters generic from start, similar
> to how it is done in prctl()? Or can this be changed later, if the need
> arises?
With discussions at LSF/MM/BPF I'll have the system call parameters be a
pointer to a structure, and a size of that structure. All the API will then
be part of the structure.
Thanks for reviewing,
-- Steve
>
> SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, arg2,
> unsigned long, arg3, unsigned long, arg4, unsigned long, arg5)
>
> > +{
> > + switch (op) {
> > + case STACKTRACE_REGISTER_SFRAME:
> > + return sframe_add_section(addr_start, addr_start + addr_length,
> > + text_start, text_start+text_length);
>
> Nit:
> text_start, text_start + text_length);
>
> > + case STACKTRACE_UNREGISTER_SFRAME:
> > + return sframe_remove_section(addr_start);
> > + }
> > + return -EINVAL;
> > +}
> Thanks and regards,
> Jens
^ permalink raw reply
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-12 16:47 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jiri Olsa, Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz,
mingo, mhiramat
In-Reply-To: <CAEf4Bza9PjbaVjFxYDmWPXXGV+Z-_Hn2Kz_KB2TOa5s-_UJ1xA@mail.gmail.com>
On Mon, May 11, 2026 at 06:41:06PM +0200, Andrii Nakryiko wrote:
> On Sun, May 10, 2026 at 2:25 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > >
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > >
> > > Probe site: jmp slot_N (5B, replaces nop5)
> > >
> > > Slot N: lea -128(%rsp), %rsp (5B) skip red zone
> > > push %rcx (1B) save (syscall clobbers)
> > > push %r11 (2B) save (syscall clobbers)
> > > push %rax (1B) save (syscall uses for nr)
> > > mov $336, %eax (5B) uprobe syscall number
> > > syscall (2B)
> > >
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > >
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > >
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > >
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > >
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > >
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > >
> > > Performance (usdt single-thread, M/s):
> > >
> > > usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change iret%
> > > Skylake 3.149 6.422 4.865 -24.3% 39.1%
> > > Milan 2.910 3.443 3.820 +11.0% 24.3%
> > > Sapphire Rapids 1.896 4.023 3.693 -8.2% 24.9%
> > > Bergamo 3.393 3.895 3.849 -1.2% 24.5%
> > >
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > >
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> >
> > hi,
> > thanks a lot for the fix
> >
> > FWIW we discussed also an option to have 10-bytes nop and do:
> > [rsp+0x80, call trampoline]
> >
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
> >
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> > [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
> >
>
> Yeah, nop10 and this jump-over-nop10 approach is an alternative. I
> don't have strong feelings apart from the ridiculousness of a 10-byte
> nop :)
>
> did you get a chance to benchmark your nop10 approach, curious how do
> the number look like
yes, it's the same as with the nop5
base:
usermode-count : 152.509 ± 0.044M/s
syscall-count : 15.177 ± 0.021M/s
uprobe-nop : 3.215 ± 0.002M/s
uprobe-push : 3.054 ± 0.003M/s
uprobe-ret : 1.100 ± 0.002M/s
uprobe-nop5 : 7.251 ± 0.034M/s
uretprobe-nop : 2.149 ± 0.012M/s
uretprobe-push : 2.088 ± 0.001M/s
uretprobe-ret : 0.960 ± 0.001M/s
uretprobe-nop5 : 3.402 ± 0.001M/s
usdt-nop : 3.185 ± 0.024M/s
usdt-nop5 : 7.378 ± 0.016M/s
nop10:
usermode-count : 152.503 ± 0.024M/s
syscall-count : 15.977 ± 0.047M/s
uprobe-nop : 3.174 ± 0.011M/s
uprobe-push : 3.030 ± 0.006M/s
uprobe-ret : 1.124 ± 0.004M/s
uprobe-nop5 : 7.201 ± 0.012M/s
uretprobe-nop : 2.141 ± 0.005M/s
uretprobe-push : 2.078 ± 0.007M/s
uretprobe-ret : 0.947 ± 0.003M/s
uretprobe-nop5 : 3.384 ± 0.014M/s
usdt-nop : 3.247 ± 0.002M/s
usdt-nop5 : 7.374 ± 0.027M/s
jirka
^ permalink raw reply
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-12 17:06 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Jiri Olsa, Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz,
mingo
In-Reply-To: <20260512141431.a70375744fdae263bda5b722@kernel.org>
On Tue, May 12, 2026 at 02:14:31PM +0900, Masami Hiramatsu wrote:
> On Sun, 10 May 2026 23:25:26 +0200
> Jiri Olsa <olsajiri@gmail.com> wrote:
>
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > >
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > >
> > > Probe site: jmp slot_N (5B, replaces nop5)
> > >
> > > Slot N: lea -128(%rsp), %rsp (5B) skip red zone
> > > push %rcx (1B) save (syscall clobbers)
> > > push %r11 (2B) save (syscall clobbers)
> > > push %rax (1B) save (syscall uses for nr)
> > > mov $336, %eax (5B) uprobe syscall number
> > > syscall (2B)
> > >
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > >
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > >
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > >
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > >
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > >
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > >
> > > Performance (usdt single-thread, M/s):
> > >
> > > usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change iret%
> > > Skylake 3.149 6.422 4.865 -24.3% 39.1%
> > > Milan 2.910 3.443 3.820 +11.0% 24.3%
> > > Sapphire Rapids 1.896 4.023 3.693 -8.2% 24.9%
> > > Bergamo 3.393 3.895 3.849 -1.2% 24.5%
> > >
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > >
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> >
> > hi,
> > thanks a lot for the fix
> >
> > FWIW we discussed also an option to have 10-bytes nop and do:
> > [rsp+0x80, call trampoline]
> >
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
>
> Does this mean we have to update UDST implementation?
it's the optimized uprobe code, that's used for usdt that emits nop5 instead
of single nop
>
> >
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> > [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
>
> Yeah, but at that moment, we know where the modified code is.
> Maybe memory dump shows different code, but that is also true
> if uprobe is active. So I think it is OK.
hum, I'm not what you mean.. I attached the kernel change from my changes,
if you want to comment on top of that
the whole change including user space changes is in here:
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=redzone_fix
jirka
---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..a6db7b76cb49 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -636,9 +636,20 @@ struct uprobe_trampoline {
unsigned long vaddr;
};
+static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
+
+#define LEA_INSN_SIZE 5
+#define UPROBE_OPT_INSN_SIZE (LEA_INSN_SIZE + CALL_INSN_SIZE)
+#define REDZONE_SIZE 0x80
+
+static bool is_lea_insn(const uprobe_opcode_t *insn)
+{
+ return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
+}
+
static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
{
- long delta = (long)(vaddr + 5 - vtramp);
+ long delta = (long)(vaddr + UPROBE_OPT_INSN_SIZE - vtramp);
return delta >= INT_MIN && delta <= INT_MAX;
}
@@ -651,7 +662,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
};
unsigned long low_limit, high_limit;
unsigned long low_tramp, high_tramp;
- unsigned long call_end = vaddr + 5;
+ unsigned long call_end = vaddr + UPROBE_OPT_INSN_SIZE;
if (check_add_overflow(call_end, INT_MIN, &low_limit))
low_limit = PAGE_SIZE;
@@ -826,8 +837,8 @@ SYSCALL_DEFINE0(uprobe)
regs->ax = args.ax;
regs->r11 = args.r11;
regs->cx = args.cx;
- regs->ip = args.retaddr - 5;
- regs->sp += sizeof(args);
+ regs->ip = args.retaddr - UPROBE_OPT_INSN_SIZE;
+ regs->sp += sizeof(args) + REDZONE_SIZE;
regs->orig_ax = -1;
sp = regs->sp;
@@ -844,12 +855,12 @@ SYSCALL_DEFINE0(uprobe)
*/
if (regs->sp != sp) {
/* skip the trampoline call */
- if (args.retaddr - 5 == regs->ip)
- regs->ip += 5;
+ if (args.retaddr - UPROBE_OPT_INSN_SIZE == regs->ip)
+ regs->ip += UPROBE_OPT_INSN_SIZE;
return regs->ax;
}
- regs->sp -= sizeof(args);
+ regs->sp -= sizeof(args) + REDZONE_SIZE;
/* for the case uprobe_consumer has changed ax/r11/cx */
args.ax = regs->ax;
@@ -857,7 +868,7 @@ SYSCALL_DEFINE0(uprobe)
args.cx = regs->cx;
/* keep return address unless we are instructed otherwise */
- if (args.retaddr - 5 != regs->ip)
+ if (args.retaddr - UPROBE_OPT_INSN_SIZE != regs->ip)
args.retaddr = regs->ip;
if (shstk_push(args.retaddr) == -EFAULT)
@@ -891,7 +902,7 @@ asm (
"pop %rax\n"
"pop %r11\n"
"pop %rcx\n"
- "ret\n"
+ "ret $0x80\n"
"int3\n"
".balign " __stringify(PAGE_SIZE) "\n"
".popsection\n"
@@ -930,9 +941,9 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
int nbytes, void *data)
{
struct write_opcode_ctx *ctx = data;
- uprobe_opcode_t old_opcode[5];
+ uprobe_opcode_t old_opcode[UPROBE_OPT_INSN_SIZE];
- uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+ uprobe_copy_from_page(page, ctx->base, old_opcode, UPROBE_OPT_INSN_SIZE);
switch (ctx->expect) {
case EXPECT_SWBP:
@@ -940,7 +951,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
return 1;
break;
case EXPECT_CALL:
- if (is_call_insn(&old_opcode[0]))
+ if (is_lea_insn(old_opcode))
return 1;
break;
}
@@ -963,7 +974,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
* - SMP sync all CPUs
*/
static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
- unsigned long vaddr, char *insn, bool optimize)
+ unsigned long vaddr, char *insn, int size, bool optimize)
{
uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
struct write_opcode_ctx ctx = {
@@ -990,7 +1001,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
/* Write all but the first byte of the patched range. */
ctx.expect = EXPECT_SWBP;
- err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
+ err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, size - 1, verify_insn,
true /* is_register */, false /* do_update_ref_ctr */,
&ctx);
if (err)
@@ -1017,17 +1028,35 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
unsigned long vaddr, unsigned long tramp)
{
- u8 call[5];
+ u8 insn[UPROBE_OPT_INSN_SIZE];
- __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
- (const void *) tramp, CALL_INSN_SIZE);
- return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+ /*
+ * We have nop10 (with first byte overwritten to int3),
+ * change it to:
+ * lea 0x80(%rsp), %rsp
+ * call tramp
+ *
+ * The first lea instruction skips the stack redzone so the call
+ * instruction can safely push return address on stack.
+ */
+ memcpy(insn, lea_rsp, LEA_INSN_SIZE);
+ __text_gen_insn(insn + LEA_INSN_SIZE, CALL_INSN_OPCODE,
+ (const void *)(vaddr + LEA_INSN_SIZE),
+ (const void *)tramp, CALL_INSN_SIZE);
+ return int3_update(auprobe, vma, vaddr, insn, UPROBE_OPT_INSN_SIZE, true /* optimize */);
}
static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
unsigned long vaddr)
{
- return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
+ /*
+ * Write JMP rel8 to end of the 10-byte slot instead of restoring the
+ * original nop10, because we could have thread already inside lea
+ * instruction.
+ */
+ u8 jmp[UPROBE_OPT_INSN_SIZE] = { 0xeb, UPROBE_OPT_INSN_SIZE - 2 };
+
+ return int3_update(auprobe, vma, vaddr, jmp, 2, false /* optimize */);
}
static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
@@ -1049,19 +1078,21 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
struct __packed __arch_relative_insn {
u8 op;
s32 raddr;
- } *call = (struct __arch_relative_insn *) insn;
+ } *call = (struct __arch_relative_insn *)(insn + LEA_INSN_SIZE);
- if (!is_call_insn(insn))
+ if (!is_lea_insn(insn))
return false;
- return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+ if (!is_call_insn((uprobe_opcode_t *) call))
+ return false;
+ return __in_uprobe_trampoline(vaddr + UPROBE_OPT_INSN_SIZE + call->raddr);
}
static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
{
- uprobe_opcode_t insn[5];
+ uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
int err;
- err = copy_from_vaddr(mm, vaddr, &insn, 5);
+ err = copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE);
if (err)
return err;
return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
@@ -1131,7 +1162,7 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
{
struct mm_struct *mm = current->mm;
- uprobe_opcode_t insn[5];
+ uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
if (!should_optimize(auprobe))
return;
@@ -1142,7 +1173,7 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
* Check if some other thread already optimized the uprobe for us,
* if it's the case just go away silently.
*/
- if (copy_from_vaddr(mm, vaddr, &insn, 5))
+ if (copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE))
goto unlock;
if (!is_swbp_insn((uprobe_opcode_t*) &insn))
goto unlock;
@@ -1160,14 +1191,23 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
static bool can_optimize(struct insn *insn, unsigned long vaddr)
{
- if (!insn->x86_64 || insn->length != 5)
+ if (!insn->x86_64)
return false;
- if (!insn_is_nop(insn))
+ /* We can't do cross page atomic writes yet. */
+ if (PAGE_SIZE - (vaddr & ~PAGE_MASK) < UPROBE_OPT_INSN_SIZE)
return false;
- /* We can't do cross page atomic writes yet. */
- return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+ if (insn->length == UPROBE_OPT_INSN_SIZE && insn_is_nop(insn))
+ return true;
+
+ /* JMP rel8 to end of slot — written by swbp_unoptimize. */
+ if (insn->length == 2 &&
+ insn->opcode.bytes[0] == 0xEB &&
+ insn->immediate.value == UPROBE_OPT_INSN_SIZE - 2)
+ return true;
+
+ return false;
}
#else /* 32-bit: */
/*
^ permalink raw reply related
* [PATCH v2 0/2] Add tracepoints support for Qualcomm GENI Serial drivers
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
jyothi.seerapu, Praveen Talari
Add tracepoints to the Qualcomm GENI (Generic Interface) serial driver.
These trace events enable runtime debugging and performance analysis of
UART operations.
The trace events cover UART termios configuration, clock setup, manual
control state, interrupt status, and actual transmitted/received data in
hexadecimal format.
Usage examples:
Enable all serial traces:
echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
cat /sys/kernel/debug/tracing/trace_pipe
Example trace output:
2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
uart_manual_rfr=0x80000002
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v2:
- removed multiple trace events for TX/RX events, instead used
DECLARE_EVENT_CLASS and DEFINE_EVENT.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-serial-v1-0-544b22612e08@oss.qualcomm.com
---
Praveen Talari (2):
serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
drivers/tty/serial/qcom_geni_serial.c | 27 ++++-
include/trace/events/qcom_geni_serial.h | 172 ++++++++++++++++++++++++++++++++
2 files changed, 195 insertions(+), 4 deletions(-)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260427-add-tracepoints-for-qcom-geni-serial-948777218b7b
Best regards,
--
Praveen Talari <praveen.talari@oss.qualcomm.com>
^ permalink raw reply
* [PATCH v2 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com>
Add tracepoint support to the Qualcomm GENI serial driver to provide
runtime visibility into driver behavior without requiring invasive debug
patches.
The trace events cover UART termios configuration, clock setup, modem
control state, interrupt status, and TX/RX data, making it easier to
diagnose communication issues in the field.
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v1->v2:
- Removed multiple TX/RX trace events, instead used
DECLARE_EVENT_CLASS and DEFINE_EVENT.
---
include/trace/events/qcom_geni_serial.h | 172 ++++++++++++++++++++++++++++++++
1 file changed, 172 insertions(+)
diff --git a/include/trace/events/qcom_geni_serial.h b/include/trace/events/qcom_geni_serial.h
new file mode 100644
index 000000000000..5e23827881d0
--- /dev/null
+++ b/include/trace/events/qcom_geni_serial.h
@@ -0,0 +1,172 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_geni_serial
+
+#if !defined(_TRACE_QCOM_GENI_SERIAL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_QCOM_GENI_SERIAL_H
+
+#include <linux/device.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(geni_serial_set_termios,
+ TP_PROTO(struct device *dev, unsigned int baud,
+ unsigned int bits_per_char, u32 tx_trans_cfg,
+ u32 tx_parity_cfg, u32 rx_trans_cfg,
+ u32 rx_parity_cfg, u32 stop_bit_len),
+ TP_ARGS(dev, baud, bits_per_char, tx_trans_cfg, tx_parity_cfg,
+ rx_trans_cfg, rx_parity_cfg, stop_bit_len),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(unsigned int, baud)
+ __field(unsigned int, bits_per_char)
+ __field(u32, tx_trans_cfg)
+ __field(u32, tx_parity_cfg)
+ __field(u32, rx_trans_cfg)
+ __field(u32, rx_parity_cfg)
+ __field(u32, stop_bit_len)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->baud = baud;
+ __entry->bits_per_char = bits_per_char;
+ __entry->tx_trans_cfg = tx_trans_cfg;
+ __entry->tx_parity_cfg = tx_parity_cfg;
+ __entry->rx_trans_cfg = rx_trans_cfg;
+ __entry->rx_parity_cfg = rx_parity_cfg;
+ __entry->stop_bit_len = stop_bit_len;
+ ),
+
+ TP_printk("%s: baud=%u bpc=%u tx_trans=0x%08x tx_par=0x%08x rx_trans=0x%08x rx_par=0x%08x stop=%u",
+ __get_str(name), __entry->baud, __entry->bits_per_char,
+ __entry->tx_trans_cfg, __entry->tx_parity_cfg,
+ __entry->rx_trans_cfg, __entry->rx_parity_cfg,
+ __entry->stop_bit_len)
+);
+
+TRACE_EVENT(geni_serial_clk_cfg,
+ TP_PROTO(struct device *dev, unsigned int desired_rate,
+ unsigned long clk_rate, unsigned int clk_div,
+ unsigned int clk_idx),
+ TP_ARGS(dev, desired_rate, clk_rate, clk_div, clk_idx),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(unsigned int, desired_rate)
+ __field(unsigned long, clk_rate)
+ __field(unsigned int, clk_div)
+ __field(unsigned int, clk_idx)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->desired_rate = desired_rate;
+ __entry->clk_rate = clk_rate;
+ __entry->clk_div = clk_div;
+ __entry->clk_idx = clk_idx;
+ ),
+
+ TP_printk("%s: desired_rate=%u clk_rate=%lu clk_div=%u clk_idx=%u",
+ __get_str(name), __entry->desired_rate, __entry->clk_rate,
+ __entry->clk_div, __entry->clk_idx)
+);
+
+TRACE_EVENT(geni_serial_irq,
+ TP_PROTO(struct device *dev, u32 m_irq, u32 s_irq,
+ u32 dma_tx, u32 dma_rx),
+ TP_ARGS(dev, m_irq, s_irq, dma_tx, dma_rx),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(u32, m_irq)
+ __field(u32, s_irq)
+ __field(u32, dma_tx)
+ __field(u32, dma_rx)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->m_irq = m_irq;
+ __entry->s_irq = s_irq;
+ __entry->dma_tx = dma_tx;
+ __entry->dma_rx = dma_rx;
+ ),
+
+ TP_printk("%s: m_irq=0x%08x s_irq=0x%08x dma_tx=0x%08x dma_rx=0x%08x",
+ __get_str(name), __entry->m_irq, __entry->s_irq,
+ __entry->dma_tx, __entry->dma_rx)
+);
+
+DECLARE_EVENT_CLASS(geni_serial_data,
+
+ TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+ TP_ARGS(dev, buf, len),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(unsigned int, len)
+ __dynamic_array(u8, data, len)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->len = len;
+ memcpy(__get_dynamic_array(data), buf, len);
+ ),
+
+ TP_printk("%s: len=%u data=%s",
+ __get_str(name), __entry->len,
+ __print_hex(__get_dynamic_array(data), __entry->len))
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_tx_data,
+
+ TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+ TP_ARGS(dev, buf, len)
+
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_rx_data,
+
+ TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+ TP_ARGS(dev, buf, len)
+
+);
+
+TRACE_EVENT(geni_serial_set_mctrl,
+ TP_PROTO(struct device *dev, unsigned int mctrl,
+ u32 uart_manual_rfr),
+ TP_ARGS(dev, mctrl, uart_manual_rfr),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(unsigned int, mctrl)
+ __field(u32, uart_manual_rfr)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->mctrl = mctrl;
+ __entry->uart_manual_rfr = uart_manual_rfr;
+ ),
+
+ TP_printk("%s: mctrl=0x%04x uart_manual_rfr=0x%08x",
+ __get_str(name), __entry->mctrl, __entry->uart_manual_rfr)
+);
+
+TRACE_EVENT(geni_serial_get_mctrl,
+ TP_PROTO(struct device *dev, unsigned int mctrl, u32 geni_ios),
+ TP_ARGS(dev, mctrl, geni_ios),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(unsigned int, mctrl)
+ __field(u32, geni_ios)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->mctrl = mctrl;
+ __entry->geni_ios = geni_ios;
+ ),
+
+ TP_printk("%s: mctrl=0x%04x geni_ios=0x%08x",
+ __get_str(name), __entry->mctrl, __entry->geni_ios)
+);
+
+#endif /* _TRACE_QCOM_GENI_SERIAL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
2.34.1
^ permalink raw reply related
* [PATCH v2 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com>
Add tracing to the Qualcomm GENI serial driver to improve runtime
observability.
Trace hooks are added at key points including termios and clock
configuration, manual control get/set, interrupt handling, and data
TX/RX paths.
Usage examples:
Enable all serial traces:
echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
cat /sys/kernel/debug/tracing/trace_pipe
Example trace output:
2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
uart_manual_rfr=0x80000002
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index e6b0a55f0cfb..9e2de074d799 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -7,6 +7,9 @@
/* Disable MMIO tracing to prevent excessive logging of unwanted MMIO traces */
#define __DISABLE_TRACE_MMIO__
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_serial.h>
+
#include <linux/clk.h>
#include <linux/console.h>
#include <linux/io.h>
@@ -225,7 +228,7 @@ static void qcom_geni_serial_config_port(struct uart_port *uport, int cfg_flags)
static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
{
unsigned int mctrl = TIOCM_DSR | TIOCM_CAR;
- u32 geni_ios;
+ u32 geni_ios = 0;
if (uart_console(uport)) {
mctrl |= TIOCM_CTS;
@@ -235,6 +238,8 @@ static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
mctrl |= TIOCM_CTS;
}
+ trace_geni_serial_get_mctrl(uport->dev, mctrl, geni_ios);
+
return mctrl;
}
@@ -253,6 +258,8 @@ static void qcom_geni_serial_set_mctrl(struct uart_port *uport,
if (!(mctrl & TIOCM_RTS) && !uport->suspended)
uart_manual_rfr = UART_MANUAL_RFR_EN | UART_RFR_NOT_READY;
writel(uart_manual_rfr, uport->membase + SE_UART_MANUAL_RFR);
+
+ trace_geni_serial_set_mctrl(uport->dev, mctrl, uart_manual_rfr);
}
static const char *qcom_geni_serial_get_type(struct uart_port *uport)
@@ -683,6 +690,8 @@ static void qcom_geni_serial_start_tx_dma(struct uart_port *uport)
xmit_size = kfifo_out_linear_ptr(&tport->xmit_fifo, &tail,
UART_XMIT_SIZE);
+ trace_geni_serial_tx_data(uport->dev, tail, xmit_size);
+
qcom_geni_set_rs485_mode(uport, SER_RS485_RTS_ON_SEND);
qcom_geni_serial_setup_tx(uport, xmit_size);
@@ -909,8 +918,10 @@ static void qcom_geni_serial_handle_rx_dma(struct uart_port *uport, bool drop)
return;
}
- if (!drop)
+ if (!drop) {
+ trace_geni_serial_rx_data(uport->dev, port->rx_buf, rx_in);
handle_rx_uart(uport, rx_in);
+ }
ret = geni_se_rx_dma_prep(&port->se, port->rx_buf,
DMA_RX_BUF_SIZE,
@@ -1069,6 +1080,10 @@ static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
geni_status = readl(uport->membase + SE_GENI_STATUS);
dma = readl(uport->membase + SE_GENI_DMA_MODE_EN);
m_irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
+
+ trace_geni_serial_irq(uport->dev, m_irq_status, s_irq_status,
+ dma_tx_status, dma_rx_status);
+
writel(m_irq_status, uport->membase + SE_GENI_M_IRQ_CLEAR);
writel(s_irq_status, uport->membase + SE_GENI_S_IRQ_CLEAR);
writel(dma_tx_status, uport->membase + SE_DMA_TX_IRQ_CLR);
@@ -1281,8 +1296,8 @@ static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
return -EINVAL;
}
- dev_dbg(port->se.dev, "desired_rate = %u, clk_rate = %lu, clk_div = %u\n, clk_idx = %u\n",
- baud * sampling_rate, clk_rate, clk_div, clk_idx);
+ trace_geni_serial_clk_cfg(uport->dev, baud * sampling_rate, clk_rate,
+ clk_div, clk_idx);
uport->uartclk = clk_rate;
port->clk_rate = clk_rate;
@@ -1432,6 +1447,10 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
writel(bits_per_char, uport->membase + SE_UART_TX_WORD_LEN);
writel(bits_per_char, uport->membase + SE_UART_RX_WORD_LEN);
writel(stop_bit_len, uport->membase + SE_UART_TX_STOP_BIT_LEN);
+
+ trace_geni_serial_set_termios(uport->dev, baud, bits_per_char,
+ tx_trans_cfg, tx_parity_cfg, rx_trans_cfg,
+ rx_parity_cfg, stop_bit_len);
}
#ifdef CONFIG_SERIAL_QCOM_GENI_CONSOLE
--
2.34.1
^ permalink raw reply related
* [PATCH v2] rtla: Stop the record trace on interrupt
From: Crystal Wood @ 2026-05-12 17:37 UTC (permalink / raw)
To: Tomas Glozar
Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
Wander Lairson Costa, Crystal Wood
Before, when rtla got a signal, it stopped the main trace but not the
record trace. With "--on-end trace", this can lead to
save_trace_to_file() failing to keep up, especially on a debug kernel.
Plus, it adds post-stoppage noise to the trace file.
Signed-off-by: Crystal Wood <crwood@redhat.com>
---
v2: clarify that this matters for --on-end trace
tools/tracing/rtla/src/common.c | 19 +++++++++++--------
tools/tracing/rtla/src/common.h | 1 -
tools/tracing/rtla/src/timerlat.c | 2 +-
3 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..effad523e8cf 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -10,7 +10,7 @@
#include "common.h"
-struct trace_instance *trace_inst;
+struct osnoise_tool *trace_tool;
volatile int stop_tracing;
int nr_cpus;
@@ -21,12 +21,16 @@ static void stop_trace(int sig)
* Stop requested twice in a row; abort event processing and
* exit immediately
*/
- tracefs_iterate_stop(trace_inst->inst);
+ if (trace_tool)
+ tracefs_iterate_stop(trace_tool->trace.inst);
return;
}
stop_tracing = 1;
- if (trace_inst)
- trace_instance_stop(trace_inst);
+ if (trace_tool) {
+ trace_instance_stop(&trace_tool->trace);
+ if (trace_tool->record)
+ trace_instance_stop(&trace_tool->record->trace);
+ }
}
/*
@@ -273,11 +277,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
tool->params = params;
/*
- * Save trace instance into global variable so that SIGINT can stop
- * the timerlat tracer.
+ * Expose the tool to signal handlers so they can stop the trace.
* Otherwise, rtla could loop indefinitely when overloaded.
*/
- trace_inst = &tool->trace;
+ trace_tool = tool;
retval = ops->apply_config(tool);
if (retval) {
@@ -285,7 +288,7 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
goto out_free;
}
- retval = enable_tracer_by_name(trace_inst->inst, ops->tracer);
+ retval = enable_tracer_by_name(tool->trace.inst, ops->tracer);
if (retval) {
err_msg("Failed to enable %s tracer\n", ops->tracer);
goto out_free;
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 51665db4ffce..eba40b6d9504 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -54,7 +54,6 @@ struct osnoise_context {
int opt_workload;
};
-extern struct trace_instance *trace_inst;
extern volatile int stop_tracing;
struct hist_params {
diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
index f8c057518d22..637f68d684f5 100644
--- a/tools/tracing/rtla/src/timerlat.c
+++ b/tools/tracing/rtla/src/timerlat.c
@@ -202,7 +202,7 @@ void timerlat_analyze(struct osnoise_tool *tool, bool stopped)
* If the trace did not stop with --aa-only, at least print
* the max known latency.
*/
- max_lat = tracefs_instance_file_read(trace_inst->inst, "tracing_max_latency", NULL);
+ max_lat = tracefs_instance_file_read(tool->trace.inst, "tracing_max_latency", NULL);
if (max_lat) {
printf(" Max latency was %s\n", max_lat);
free(max_lat);
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: jane.chu @ 2026-05-12 17:58 UTC (permalink / raw)
To: David Hildenbrand (Arm), Breno Leitao, Miaohe Lin,
Naoya Horiguchi, Andrew Morton, Jonathan Corbet, Shuah Khan,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <9504c193-8c01-4d03-8f62-c50fd7fbdbc0@kernel.org>
On 5/12/2026 1:17 AM, David Hildenbrand (Arm) wrote:
> On 5/11/26 17:38, Breno Leitao wrote:
>> When get_hwpoison_page() returns a negative value, distinguish
>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>> and should be classified accordingly for proper handling.
>>
>> Sample PG_reserved before the get_hwpoison_page() call. In the
>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>> reference before returning -EIO, after which the underlying page may
>> have been freed and reallocated with page->flags reset; reading
>> PageReserved(p) at that point would observe stale or unrelated state.
>> The pre-call snapshot reflects what the page actually was at the
>> time of the failure event.
>>
>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>> mm/memory-failure.c | 19 ++++++++++++++++++-
>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 866c4428ac7ef..f112fb27a8ff6 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>> unsigned long page_flags;
>> bool retry = true;
>> int hugetlb = 0;
>> + bool is_reserved;
>>
>> if (!sysctl_memory_failure_recovery)
>> panic("Memory failure on page %lx", pfn);
>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>> * In fact it's dangerous to directly bump up page count from 0,
>> * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>> */
>> + /*
>> + * Pages with PG_reserved set are not currently managed by the
>> + * page allocator (memblock-reserved memory, driver reservations,
>> + * etc.), so classify them as kernel-owned for reporting.
>> + *
>> + * Sample the flag before get_hwpoison_page(): in the
>> + * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>> + * reference before returning -EIO, after which page->flags may
>> + * have been reset by the allocator.
>> + */
>> + is_reserved = PageReserved(p);
>> +
>> res = get_hwpoison_page(p, flags);
>> if (!res) {
>> if (is_free_buddy_page(p)) {
>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>> }
>> goto unlock_mutex;
>> } else if (res < 0) {
>> - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>> + if (is_reserved)
>> + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>> + else
>> + res = action_result(pfn, MF_MSG_GET_HWPOISON,
>> + MF_IGNORED);
>> goto unlock_mutex;
>> }
>>
>>
>
> It's a bit odd that we need this handling when we already have handling for
> reserved pages in error_states[].
>
> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
> __get_hwpoison_page() ... would always fail? Making
> get_hwpoison_page()->get_any_page() always fail?
>
> But then, we never call identify_page_state()? And never call me_kernel()?
>
> This all looks very odd.
>
> Why would you even want to call get_hwpoison_page() in the first place if you
> find PageReserved?
>
Ah, good point!
It seems to me that all unhandable pages should head out to
identify_page_state:
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2411,6 +2411,10 @@ int memory_failure(unsigned long pfn, int flags)
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
+
+ if (!HWPoisonHandlable(page, flags)
+ goto identify_page_state;
+
res = get_hwpoison_page(p, flags);
if (!res) {
if (is_free_buddy_page(p)) {
thanks,
-jane
^ permalink raw reply
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Alexei Starovoitov @ 2026-05-12 19:27 UTC (permalink / raw)
To: Jiri Olsa
Cc: Masami Hiramatsu, Andrii Nakryiko, bpf, linux-trace-kernel,
Oleg Nesterov, Peter Zijlstra, Ingo Molnar
In-Reply-To: <agNeEzjiThzmJHiP@krava>
On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> + /*
> + * We have nop10 (with first byte overwritten to int3),
> + * change it to:
> + * lea 0x80(%rsp), %rsp
> + * call tramp
> + *
> + * The first lea instruction skips the stack redzone so the call
> + * instruction can safely push return address on stack.
> + */
typo: lea -128(%rsp), %rsp
you can also do:
add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.
Initially I didn't like this approach, since we just introduced
usdt nop5 and now need to recompile everything again,
but looking at the fix it's definitely simpler than alternatives
and doesn't have annoying limitations.
^ permalink raw reply
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Andrii Nakryiko @ 2026-05-12 19:38 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Jiri Olsa, Masami Hiramatsu, Andrii Nakryiko, bpf,
linux-trace-kernel, Oleg Nesterov, Peter Zijlstra, Ingo Molnar
In-Reply-To: <CAADnVQLfgxEzpSTLhsxN2BYWpSxJ+RYku03UMfrSTi4Abu5SBw@mail.gmail.com>
On Tue, May 12, 2026 at 12:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > + /*
> > + * We have nop10 (with first byte overwritten to int3),
> > + * change it to:
> > + * lea 0x80(%rsp), %rsp
> > + * call tramp
> > + *
> > + * The first lea instruction skips the stack redzone so the call
> > + * instruction can safely push return address on stack.
> > + */
>
> typo: lea -128(%rsp), %rsp
>
> you can also do:
>
> add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.
When I asked AI about this it explained that add instruction modifies
flags, so it's not a good fit here. lea doesn't touch flags.
>
> Initially I didn't like this approach, since we just introduced
> usdt nop5 and now need to recompile everything again,
> but looking at the fix it's definitely simpler than alternatives
> and doesn't have annoying limitations.
yeah, limitations are annoying, especially with those global "DO NOT
OPTIMIZE" flags... Jiri, let's polish your version and land it?
^ permalink raw reply
* Re: [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Steven Rostedt @ 2026-05-12 19:41 UTC (permalink / raw)
To: Chen Jun; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260508122623.74290-1-chenjun102@huawei.com>
On Fri, 8 May 2026 20:26:23 +0800
Chen Jun <chenjun102@huawei.com> wrote:
> Low-level functions have many call paths, and sometimes
> we only care about the calls on a specific call path.
> Add a new filter to filter based on the call stack.
>
> Usage:
> 1. echo 'caller=="$function_name"' > events/../filter
>
> Only support OP_EQ and OP_NE
Cute.
>
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> ---
> include/linux/trace_events.h | 1 +
> kernel/trace/trace.h | 3 ++-
> kernel/trace/trace_events.c | 1 +
> kernel/trace/trace_events_filter.c | 40 ++++++++++++++++++++++++++++--
> 4 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 40a43a4c7caf..1f109669a391 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -851,6 +851,7 @@ enum {
> FILTER_COMM,
> FILTER_CPU,
> FILTER_STACKTRACE,
> + FILTER_CALLER,
> };
>
> extern int trace_event_raw_init(struct trace_event_call *call);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 80fe152af1dd..4e4b92ce264f 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -1825,7 +1825,8 @@ static inline bool is_string_field(struct ftrace_event_field *field)
> field->filter_type == FILTER_RDYN_STRING ||
> field->filter_type == FILTER_STATIC_STRING ||
> field->filter_type == FILTER_PTR_STRING ||
> - field->filter_type == FILTER_COMM;
> + field->filter_type == FILTER_COMM ||
> + field->filter_type == FILTER_CALLER;
> }
>
> static inline bool is_function_field(struct ftrace_event_field *field)
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index c46e623e7e0d..6d220d7eec73 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -199,6 +199,7 @@ static int trace_define_generic_fields(void)
> __generic_field(char *, comm, FILTER_COMM);
> __generic_field(char *, stacktrace, FILTER_STACKTRACE);
> __generic_field(char *, STACKTRACE, FILTER_STACKTRACE);
> + __generic_field(char *, caller, FILTER_CALLER);
>
> return ret;
> }
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 609325f57942..1cf040065abe 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -72,6 +72,7 @@ enum filter_pred_fn {
> FILTER_PRED_FN_CPUMASK,
> FILTER_PRED_FN_CPUMASK_CPU,
> FILTER_PRED_FN_FUNCTION,
> + FILTER_PRED_FN_CALLER,
> FILTER_PRED_FN_,
> FILTER_PRED_TEST_VISITED,
> };
> @@ -1009,6 +1010,21 @@ static int filter_pred_function(struct filter_pred *pred, void *event)
> return pred->op == OP_EQ ? ret : !ret;
> }
>
> +/* Filter predicate for caller. */
> +static int filter_pred_caller(struct filter_pred *pred, void *event)
> +{
> + unsigned long entries[32];
Let's make that only 16 in size. Having 256 bytes added to the stack in
random places may cause an overflow. 128 bytes isn't as bad. Either that,
or we need to preallocate per-cpu memory and use that. But that makes the
patch much more complex. I rather just use 16 entries instead for now. If
we need more, then we can add the extra complexity.
Also, you need to update Documentation/trace/events.rst.
Thanks,
-- Steve
> + unsigned int nr_entries;
> + int i;
> +
> + nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
> + for (i = 0; i < nr_entries ; i++)
> + if (pred->val <= entries[i] && entries[i] < pred->val2)
> + return !pred->not;
> +
> + return pred->not;
> +}
> +
> /*
> * regex_match_foo - Basic regex callbacks
> *
> @@ -1617,6 +1633,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
> return filter_pred_cpumask_cpu(pred, event);
> case FILTER_PRED_FN_FUNCTION:
> return filter_pred_function(pred, event);
> + case FILTER_PRED_FN_CALLER:
> + return filter_pred_caller(pred, event);
> case FILTER_PRED_TEST_VISITED:
> return test_pred_visited_fn(pred, event);
> default:
> @@ -2002,10 +2020,28 @@ static int parse_pred(const char *str, void *data,
>
> } else if (field->filter_type == FILTER_DYN_STRING) {
> pred->fn_num = FILTER_PRED_FN_STRLOC;
> - } else if (field->filter_type == FILTER_RDYN_STRING)
> + } else if (field->filter_type == FILTER_RDYN_STRING) {
> pred->fn_num = FILTER_PRED_FN_STRRELLOC;
> - else {
> + } else if (field->filter_type == FILTER_CALLER) {
> + unsigned long caller;
> +
> + if (op == OP_GLOB)
> + goto err_free;
>
> + pred->fn_num = FILTER_PRED_FN_CALLER;
> + caller = kallsyms_lookup_name(pred->regex->pattern);
> + if (!caller) {
> + parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> + goto err_free;
> + }
> + /* Now find the function start and end address */
> + if (!kallsyms_lookup_size_offset(caller, &size, &offset)) {
> + parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> + goto err_free;
> + }
> + pred->val = caller - offset;
> + pred->val2 = pred->val + size;
> + } else {
> if (!ustring_per_cpu) {
> /* Once allocated, keep it around for good */
> ustring_per_cpu = alloc_percpu(struct ustring_buffer);
^ permalink raw reply
* Re: [PATCH RFC v5 10/53] KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-05-12 22:30 UTC (permalink / raw)
To: Liam R. Howlett
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <1DAB05E2-7F30-45D7-B155-B66C59D31AFF@infradead.org>
"Liam R. Howlett" <liam@infradead.org> writes:
>
> [...snip...]
>
>>
>>The invariant in this maple tree is that contiguous ranges with the same
>>attribute are stored as a single range.
>>
>>The goal of this first part is to get the entry at the index just after
>>the requested range, and see what the attribute there is. If that
>>attribute is what we're about to set, extend the requested range for
>>storing to the end of that range.
>>
>>If there is another range higher than end + 1, with the invariant
>>maintained, that attribute has to be different than the attribute stored
>>at end. Hence, we only want to extend this requested range up till end.
>>
>
> mas_find() will look for an entry at the given address for the first search, and if it is not found it will continue to search upwards. Since you limit the search to end, it will work as you want and there isn't a bug as I was thinking in my sleep deprived state.
>
> Since you are searching for exactly one address (end), it might serve you better to walk there. Maybe walking is a better API for what you are doing here?
>
Thanks again for this tip! I'll try the walk API in the next revision
after v6 [1]
[1] https://lore.kernel.org/all/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com/T/
>
>>> Do you have testing of these functions somewhere?
>>>
>>
>>GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4) tests setting
>>attributes in ranges. If test_page is 2,
>>
>>1. [0, 4) starts off shared (4 is the number of pages in the guest_memfd)
>>2. [2, 3) is converted to private
>> => so the ranges should now be [0, 2), [2, 3), [3, 4)
>>3. [2, 3) is converted back to shared
>> => so the ranges should now be [0, 4)
>>
>>I verified this by inserting some trace_printk()s and inspecting manually.
>>
>
> Thanks. I find the exclusive ranges a bit odd to think about in the maple tree context, but this test case makes sense. This is especially odd to look at a single index entry, at least for me.
>
> I generally have a set of test cases and append any bug reproduces to that list so they are unlikely to reoccur. My testing is certainly different from what you'll be doing, but this method has done well with the quality of code improving over time, and limited (if any) regressions.
>
I've not worked directly with the maple tree tests but the xarray tests
(similarly set up, I believe) are a joy to work with.
> I actually insist that any fix has a test before I accept them. There are two reasons for this: 1. Avoiding the regression. 2. People really understand the bug if they can create a reproducer.
>
> I hope this helps.
>
>
The maple tree tests are set up to directly test maple tree code, but
KVM selftests test from the userspace interface, and it's hard to test
this invariant from userspace.
>>>> + if (entry && xa_to_value(entry) == attributes)
>>>> + last = mas->last;
>>>> +
>>>> + if (start > 0) {
>>>> + mas_set_range(mas, start - 1, start - 1);
>>>> + entry = mas_find(mas, start - 1);
>>>> + if (entry && xa_to_value(entry) == attributes)
>>>> + start = mas->index;
>>>> + }
>>>> +
>>>> + mas_set_range(mas, start, last);
>>>> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>>>> +}
>>>> +
>>>>
>>>> [...snip...]
>>>>
^ permalink raw reply
* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13 0:45 UTC (permalink / raw)
To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>
Hi qiwu.chen,
kernel test robot noticed the following build warnings:
[auto build test WARNING on linus/master]
[also build test WARNING on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base: linus/master
patch link: https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: sh-defconfig (https://download.01.org/0day-ci/archive/20260513/202605130842.zWTTtyaL-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605130842.zWTTtyaL-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605130842.zWTTtyaL-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from include/trace/define_trace.h:132,
from include/trace/events/vmscan.h:602,
from mm/vmscan.c:72:
include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:358:19: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
358 | TP_printk("folio=%p lru=%s",
| ^~~~~~~~~~~~~~~~~
include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
219 | trace_event_printf(iter, print); \
| ^~~~~
include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
45 | PARAMS(print)); \
| ^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
358 | TP_printk("folio=%p lru=%s",
| ^~~~~~~~~
In file included from include/trace/trace_events.h:256:
include/trace/events/vmscan.h:358:27: note: format string is defined here
358 | TP_printk("folio=%p lru=%s",
| ~^
| |
| void *
| %ld
include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
354 | __entry->folio = folio;
| ^
include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
427 | { assign; } \
| ^~~~~~
include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
435 | PARAMS(assign), PARAMS(print)) \
| ^~~~~~
include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
40 | DECLARE_EVENT_CLASS(name, \
| ^~~~~~~~~~~~~~~~~~~
include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
44 | PARAMS(assign), \
| ^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
353 | TP_fast_assign(
| ^~~~~~~~~~~~~~
In file included from include/trace/define_trace.h:133:
include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
354 | __entry->folio = folio;
| ^
include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
51 | { assign; } \
| ^~~~~~
include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
67 | PARAMS(assign), PARAMS(print)) \
| ^~~~~~
include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
40 | DECLARE_EVENT_CLASS(name, \
| ^~~~~~~~~~~~~~~~~~~
include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
44 | PARAMS(assign), \
| ^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
353 | TP_fast_assign(
| ^~~~~~~~~~~~~~
vim +358 include/trace/events/vmscan.h
343
344 TP_PROTO(struct folio *folio),
345
346 TP_ARGS(folio),
347
348 TP_STRUCT__entry(
349 __field(unsigned long, folio)
350 __field(int, lru)
351 ),
352
353 TP_fast_assign(
354 __entry->folio = folio;
355 __entry->lru = folio_lru_list(folio);
356 ),
357
> 358 TP_printk("folio=%p lru=%s",
359 __entry->folio,
360 __print_symbolic(__entry->lru, LRU_NAMES))
361 );
362
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13 1:19 UTC (permalink / raw)
To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>
Hi qiwu.chen,
kernel test robot noticed the following build warnings:
[auto build test WARNING on linus/master]
[also build test WARNING on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base: linus/master
patch link: https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: xtensa-randconfig-001-20260513 (https://download.01.org/0day-ci/archive/20260513/202605130942.9wJFWm9M-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605130942.9wJFWm9M-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605130942.9wJFWm9M-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from include/trace/define_trace.h:132,
from include/trace/events/vmscan.h:602,
from mm/vmscan.c:72:
include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
include/trace/events/vmscan.h:358:12: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
TP_printk("folio=%p lru=%s",
^~~~~~~~~~~~~~~~~
include/trace/trace_events.h:219:27: note: in definition of macro 'DECLARE_EVENT_CLASS'
trace_event_printf(iter, print); \
^~~~~
include/trace/trace_events.h:45:9: note: in expansion of macro 'PARAMS'
PARAMS(print)); \
^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
TRACE_EVENT(mm_vmscan_write_folio,
^~~~~~~~~~~
include/trace/events/vmscan.h:358:2: note: in expansion of macro 'TP_printk'
TP_printk("folio=%p lru=%s",
^~~~~~~~~
In file included from include/trace/trace_events.h:256,
from include/trace/define_trace.h:132,
from include/trace/events/vmscan.h:602,
from mm/vmscan.c:72:
include/trace/events/vmscan.h:358:20: note: format string is defined here
TP_printk("folio=%p lru=%s",
~^
%ld
In file included from include/trace/define_trace.h:132,
from include/trace/events/vmscan.h:602,
from mm/vmscan.c:72:
include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:18: warning: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
__entry->folio = folio;
^
include/trace/trace_events.h:427:4: note: in definition of macro '__DECLARE_EVENT_CLASS'
{ assign; } \
^~~~~~
include/trace/trace_events.h:435:9: note: in expansion of macro 'PARAMS'
PARAMS(assign), PARAMS(print)) \
^~~~~~
include/trace/trace_events.h:40:2: note: in expansion of macro 'DECLARE_EVENT_CLASS'
DECLARE_EVENT_CLASS(name, \
^~~~~~~~~~~~~~~~~~~
include/trace/trace_events.h:44:9: note: in expansion of macro 'PARAMS'
PARAMS(assign), \
^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
TRACE_EVENT(mm_vmscan_write_folio,
^~~~~~~~~~~
include/trace/events/vmscan.h:353:2: note: in expansion of macro 'TP_fast_assign'
TP_fast_assign(
^~~~~~~~~~~~~~
vim +354 include/trace/events/vmscan.h
343
344 TP_PROTO(struct folio *folio),
345
346 TP_ARGS(folio),
347
348 TP_STRUCT__entry(
349 __field(unsigned long, folio)
350 __field(int, lru)
351 ),
352
353 TP_fast_assign(
> 354 __entry->folio = folio;
355 __entry->lru = folio_lru_list(folio);
356 ),
357
358 TP_printk("folio=%p lru=%s",
359 __entry->folio,
360 __print_symbolic(__entry->lru, LRU_NAMES))
361 );
362
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13 3:04 UTC (permalink / raw)
To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>
Hi qiwu.chen,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base: linus/master
patch link: https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: sh-defconfig (https://download.01.org/0day-ci/archive/20260513/202605131057.E7FZbuAc-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605131057.E7FZbuAc-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605131057.E7FZbuAc-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from include/trace/define_trace.h:132,
from include/trace/events/vmscan.h:602,
from mm/vmscan.c:72:
include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
include/trace/events/vmscan.h:358:19: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
358 | TP_printk("folio=%p lru=%s",
| ^~~~~~~~~~~~~~~~~
include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
219 | trace_event_printf(iter, print); \
| ^~~~~
include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
45 | PARAMS(print)); \
| ^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
358 | TP_printk("folio=%p lru=%s",
| ^~~~~~~~~
In file included from include/trace/trace_events.h:256:
include/trace/events/vmscan.h:358:27: note: format string is defined here
358 | TP_printk("folio=%p lru=%s",
| ~^
| |
| void *
| %ld
include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
354 | __entry->folio = folio;
| ^
include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
427 | { assign; } \
| ^~~~~~
include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
435 | PARAMS(assign), PARAMS(print)) \
| ^~~~~~
include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
40 | DECLARE_EVENT_CLASS(name, \
| ^~~~~~~~~~~~~~~~~~~
include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
44 | PARAMS(assign), \
| ^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
353 | TP_fast_assign(
| ^~~~~~~~~~~~~~
In file included from include/trace/define_trace.h:133:
include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
354 | __entry->folio = folio;
| ^
include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
51 | { assign; } \
| ^~~~~~
include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
67 | PARAMS(assign), PARAMS(print)) \
| ^~~~~~
include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
40 | DECLARE_EVENT_CLASS(name, \
| ^~~~~~~~~~~~~~~~~~~
include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
44 | PARAMS(assign), \
| ^~~~~~
include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
342 | TRACE_EVENT(mm_vmscan_write_folio,
| ^~~~~~~~~~~
include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
353 | TP_fast_assign(
| ^~~~~~~~~~~~~~
vim +354 include/trace/events/vmscan.h
343
344 TP_PROTO(struct folio *folio),
345
346 TP_ARGS(folio),
347
348 TP_STRUCT__entry(
349 __field(unsigned long, folio)
350 __field(int, lru)
351 ),
352
353 TP_fast_assign(
> 354 __entry->folio = folio;
355 __entry->lru = folio_lru_list(folio);
356 ),
357
358 TP_printk("folio=%p lru=%s",
359 __entry->folio,
360 __print_symbolic(__entry->lru, LRU_NAMES))
361 );
362
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Wen Yang @ 2026-05-13 5:32 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8e80cbcf739304de95356f1fac677261628977fa.camel@redhat.com>
On 5/12/26 17:09, Gabriele Monaco wrote:
> On Tue, 2026-05-12 at 10:27 +0200, Gabriele Monaco wrote:
>> On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
>>> From: Wen Yang <wen.yang@linux.dev>
>>>
>>> The following two paths race:
>>>
>>> CPU 0 (disable_stall/__rv_disable_monitor) CPU 1 (wwnr probe handler)
>> ^ did you mean stall?
>
> Ok I got it now, so essentially you'd reproduce it like:
>
> * start a DA per-task monitor (no timer)
> * stop it, a handler is still running after reset, it sets monitoring back to 1
> * start an HA per-task monitor
>
> that would use the same slot that is now looking like:
>
> { monitoring = 1, timer.function = NULL }
>
> because it was not initialised as HA but monitoring was reset in the race.
>
> Thinking about this again, it isn't just an issue with per-task monitors, all
> monitors reusing slots would suffer from it.
> Besides, relying on monitoring can be fragile when using LTL monitors on the
> same task (those don't even have monitoring).
>
> Perhaps the solution isn't that trivial, I'm going to give one more thought on
> it, but thanks again for bringing this up!
>
> Gabriele
>
>>> ------------------------------------------ -----------------------------
>>> disable_stall()
>>> da_monitor_destroy()
>>> da_monitor_reset_all() <------ [task T: monitoring=0]
>>> da_monitor_start(&T->rv[n])
>>> /* no timer_setup */
>>> monitoring=1 <----
>>> tracepoint_synchronize_unregister()
>>> // CPU 1 probe has already returned; sync returns
>>>
>>> Later, enable_stall() acquires the same slot and calls da_monitor_init():
>>>
>>> da_monitor_reset_all()
>>> da_monitor_reset(&T->rv[slot]) // monitoring=1, timer.function==0
>>> ha_monitor_reset_env()
>>> ha_cancel_timer()
>>> timer_delete(&ha_mon->timer) // ODEBUG: timer never initialised
>>>
>>> ODEBUG: assert_init not available (active state 0)
>>> object type: timer_list
>>> Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
>>>
>>> Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
>>> before da_monitor_reset_all(). The unregister_trace_xxx() calls in the
>>> monitor's disable() have already disconnected the tracepoints; the sync
>>> here drains any handler still in flight, so no new monitoring=1 can
>>> appear after da_monitor_reset_all() clears the slot.
>>>
>>> Also fix the slot release ordering: release the slot only after
>>> reset_all() to avoid accessing rv[] with an out-of-bounds index.
>>>
>>> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
>>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>>> ---
>>
>> Thanks for the fix, I have a similar one waiting for submission.
>>
>> These are technically 2 separate fixes though: the ordering with unset
>> task_mon_slot (independent on HA) and the synchronisation with pending
>> tracepoints. They probably deserve separate patches and visibility, the first
>> has always been around and we're technically overwriting who knows what.
>>
>>
>> The explanation above is a bit hard to follow though, are you talking about a
>> handler for the same (stall) monitor running after the reset, effectively
>> undoing it by setting the monitoring flag?
>>
>> Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
>> environment.
>>
>> So that's basically what you'd see now much more often because in fact we
>> don't
>> reset the right slot (though, again, that's a different issue).
>>
>>
>> Calling tracepoint_synchronize_unregister() there too would surely fix, but it
>> used to be kinda slow. But it's probably gotten faster since now tracepoints
>> use
>> SRCU, so we can wait for a dedicated grace period.
>>
>> I liked the idea to wait cumulatively in the end, but that's just making
>> things
>> harder.. Let's do like this:
>>
>> Prepare 2 separate patches as fixes, put the task slot one first (would ease
>> backporting), mention this issue with the race condition only in the second.
>> You can send them independently and I'll add them to the tree as urgent.
>>
>>
>> I'm soon going to send my set of fixes that will also include the task slot
>> patch (not removing to ease my life with conflicts).
>>
Hi Gabriele,
Thanks for both messages. Two patches are ready; let me address
your follow-up concerns before sending.
1. "all monitors reusing slots would suffer from it"
Only RV_MON_PER_TASK uses the rv_get/put_task_monitor_slot()
pool. RV_MON_GLOBAL and RV_MON_PER_CPU each have dedicated
storage (a single static variable and a per-cpu variable) and
never share slots across monitor types. The race is exclusive
to PER_TASK, so fixing that variant's da_monitor_destroy() is
the correct scope.
2. "LTL monitors don't even have monitoring"
tracepoint_synchronize_unregister() does not rely on the
monitoring flag at all. It is a system-wide barrier — it
calls synchronize_rcu_tasks_trace() followed by
synchronize_srcu(&tracepoint_srcu) — draining every in-flight
tracepoint handler on every CPU regardless of which monitor
dispatched it. LTL handlers are covered without any special
treatment.
The slot-ordering issue (patch 1) affects all per-task DA monitors,
not only HA ones — "independent on HA" — because
RV_PER_TASK_MONITOR_INIT equals CONFIG_RV_PER_TASK_MONITORS (one
past the end of rv[]), so da_monitor_reset_all() overwrites whatever
follows rv[] in task_struct whenever any per-task monitor is
disabled.
Also corrected "wwnr probe handler" to "stall probe handler" in
patch 2 per your annotation.
Please let me know if the above reasoning addresses your concerns.
--
Best wishes,
Wen
>>
>>> include/rv/da_monitor.h | 18 ++++++++++++++++--
>>> 1 file changed, 16 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
>>> index 00ded3d5ab3f..d04bb3229c75 100644
>>> --- a/include/rv/da_monitor.h
>>> +++ b/include/rv/da_monitor.h
>>> @@ -304,6 +304,20 @@ static int da_monitor_init(void)
>>>
>>> /*
>>> * da_monitor_destroy - return the allocated slot
>>> + *
>>> + * Call tracepoint_synchronize_unregister() before reset_all() to close
>>> + * the race where an in-flight non-HA probe handler sets monitoring=1
>>> + * (without calling timer_setup()) after da_monitor_reset_all() has
>>> + * already cleared the slot but before the caller's own sync completes.
>>> + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
>>> + * the same slot would call timer_delete() on a never-initialised
>>> + * timer_list, triggering ODEBUG warnings.
>>> + *
>>> + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
>>> + * that waits for all CPUs to finish any in-flight tracepoint handlers.
>>> + * The caller's own __rv_disable_monitor() issues a second sync after
>>> + * returning from disable(); that redundant call is harmless on the
>>> + * infrequent admin (enable/disable) path.
>>> */
>>> static inline void da_monitor_destroy(void)
>>> {
>>> @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
>>> WARN_ONCE(1, "Disabling a disabled monitor: "
>>> __stringify(MONITOR_NAME));
>>> return;
>>> }
>>> + tracepoint_synchronize_unregister();
>>> + da_monitor_reset_all();
>>> rv_put_task_monitor_slot(task_mon_slot);
>>> task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>>> -
>>> - da_monitor_reset_all();
>>> }
>>>
>>> #elif RV_MON_TYPE == RV_MON_PER_OBJ
>
^ permalink raw reply
* Re: [RFC PATCH v2 10/10] selftests/verification: add tlob selftests
From: Gabriele Monaco @ 2026-05-13 7:46 UTC (permalink / raw)
To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8148267505ef90175b6b69e1ffb3aa560ff42d35.1778522945.git.wen.yang@linux.dev>
On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add selftest coverage for the tlob RV monitor in
> tools/testing/selftests/verification/.
>
> Two helper binaries are built by tlob/Makefile: tlob_helper for the
> ioctl interface (/dev/rv) and tlob_uprobe_target for the uprobe tests.
> The top-level Makefile delegates to tlob/ via a generic MONITOR_SUBDIRS
> pattern so monitor-specific build details stay within each monitor's
> own subdirectory.
>
> Eight test files cover the tracefs control interface (tracefs.tc), the
> ioctl self-instrumentation interface (ioctl.tc, 8 scenarios), and the
> uprobe external monitoring interface (uprobe_bind.tc, uprobe_violation.tc,
> uprobe_no_event.tc, uprobe_multi.tc, uprobe_detail_sleeping.tc,
> uprobe_detail_waiting.tc).
Thanks for the deep test suite!
I run it on a VM (virtme-ng on my x86 16 core fedora box) and have it hanging at
step 9 (you see 8 is ok and after I get an RCU splat):
$ sudo vng -v -- make -C tools/testing/selftests/verification run_tests
...
# ok 5 Test tlob ioctl self-instrumentation (within/over-budget, error paths)
# ok 6 Test tlob monitor tracefs interface (enable/disable and files)
# ok 7 Test uprobe binding (visible in monitor file, removable, duplicate rejected)
# ok 8 Test uprobe detail sleeping (sleeping_ns dominates when task blocks between probes)
[ 53.989561] tlob_target (1756) used greatest stack depth: 11792 bytes left
[ 75.100818] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 75.100825] rcu: 0-...!: (26082 ticks this GP) idle=a8e4/1/0x4000000000000000 softirq=0/0 fqs=13 rcuc=26078 jiffies(starved)
[ 75.100833] rcu: (t=26000 jiffies g=17333 q=146 ncpus=16)
[ 75.100836] rcu: rcu_preempt kthread timer wakeup didn't happen for 24040 jiffies! g17333 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 75.100839] rcu: Possible timer handling issue on cpu=7 timer-softirq=317
[ 75.100840] rcu: rcu_preempt kthread starved for 24043 jiffies! g17333 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=7
[ 75.100843] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 75.100843] rcu: RCU grace-period kthread stack dump:
[ 75.100845] task:rcu_preempt state:I stack:14104 pid:17 tgid:17 ppid:2 task_flags:0x208040 flags:0x00080000
[ 75.100856] Call Trace:
[ 75.100859] <TASK>
[ 75.100870] __schedule+0x4f1/0x1490
[ 75.100890] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 75.100898] schedule+0x5b/0x210
[ 75.100901] ? schedule_timeout+0xae/0x130
[ 75.100905] schedule_timeout+0xae/0x130
[ 75.100911] ? __pfx_process_timeout+0x10/0x10
[ 75.100925] rcu_gp_fqs_loop+0x114/0x880
[ 75.100933] ? lock_release+0x2ea/0x4a0
[ 75.100945] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 75.100948] rcu_gp_kthread+0x26b/0x320
[ 75.100951] ? preempt_count_sub+0x5f/0x80
[ 75.100963] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 75.100966] kthread+0xf3/0x130
[ 75.100970] ? __pfx_kthread+0x10/0x10
[ 75.100978] ret_from_fork+0x3b4/0x420
[ 75.100984] ? __pfx_kthread+0x10/0x10
[ 75.100989] ret_from_fork_asm+0x1a/0x30
[ 75.101018] </TASK>
[ 75.101019] rcu: Stack dump where RCU GP kthread last ran:
[ 75.101021] Sending NMI from CPU 0 to CPUs 7:
[ 75.101106] NMI backtrace for cpu 7
[ 75.101118] CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 7.1.0-rc2+ #160 PREEMPT_{RT,(lazy)}
[ 75.101124] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 75.101128] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 75.101139] Code: 75 70 00 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 25 6e 1c 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 75.101142] RSP: 0018:ffffd22ec0103eb8 EFLAGS: 00000296
[ 75.101147] RAX: 00000000000529f3 RBX: 0000000000000000 RCX: ffffffff8ca56131
[ 75.101170] RDX: ffff8de4c185c280 RSI: 0000000000000000 RDI: ffffffff8ca56131
[ 75.101172] RBP: ffff8de4c185c280 R08: 0000000000000000 R09: 0000000000000000
[ 75.101174] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000007
[ 75.101176] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 75.101373] FS: 0000000000000000(0000) GS:ffff8de56a091000(0000) knlGS:0000000000000000
[ 75.101379] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 75.101381] CR2: 00007fb886a53f98 CR3: 000000003be5c002 CR4: 0000000000770ef0
[ 75.101383] PKRU: 55555554
[ 75.101384] Call Trace:
[ 75.101388] <TASK>
[ 75.101389] default_idle+0x9/0x10
[ 75.101397] default_idle_call+0x85/0x240
[ 75.101404] do_idle+0x291/0x300
[ 75.101412] ? schedule_idle+0x22/0x40
[ 75.101415] cpu_startup_entry+0x29/0x30
[ 75.101418] start_secondary+0xf8/0x100
[ 75.101424] common_startup_64+0x12c/0x138
[ 75.101435] </TASK>
[ 75.102036] CPU: 0 UID: 0 PID: 1758 Comm: sh Not tainted 7.1.0-rc2+ #160 PREEMPT_{RT,(lazy)}
[ 75.102040] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 75.102042] RIP: 0033:0x556458604e3f
[ 75.102049] Code: 3c 18 4e 8d 04 3f 42 c6 04 21 00 0f b6 01 4c 89 7d b0 4c 89 c3 e9 bf ed ff ff 90 41 0f b6 c1 48 8d 15 c5 3f 11 00 80 3c 02 00 <0f> 84 a9 f0 ff ff 48 8b 45 80 f6 40 08 50 0f 85 9b f0 ff ff e9 78
[ 75.102051] RSP: 002b:00007ffc7ac46e30 EFLAGS: 00000246
[ 75.102054] RAX: 0000000000000074 RBX: 0000000000000074 RCX: 000055646adb8a60
[ 75.102056] RDX: 0000556458718e00 RSI: 0000000000000018 RDI: 0000000000000000
[ 75.102057] RBP: 00007ffc7ac46f20 R08: 000055646adc3100 R09: 0000000000000074
[ 75.102058] R10: 0000000000000021 R11: 0000000000000001 R12: 0000000000000000
[ 75.102059] R13: 0000000000000070 R14: 000055646adb9cf0 R15: 0000000000000000
[ 75.102061] FS: 00007f832822b740 GS: 0000000000000000
Did you see that? Am I doing something wrong?
Thanks,
Gabriele
>
> Tested on x86_64 with vng (virtme-ng):
>
> TAP version 13
> 1..12
> ok 1 Test monitor enable/disable
> ok 2 Test monitor reactor setting
> ok 3 Check available monitors
> ok 4 Test wwnr monitor with printk reactor
> ok 5 Test tlob ioctl self-instrumentation (within/over-budget, error paths)
> ok 6 Test tlob monitor tracefs interface (enable/disable and files)
> ok 7 uprobe binding: visible in monitor file, removable, duplicate offset
> rejected
> ok 8 uprobe detail sleeping: sleeping_ns dominates when task blocks between
> probes
> ok 9 uprobe detail waiting: waiting_ns dominates when task is preempted
> between probes
> ok 10 Two bindings on same binary with different offsets and budgets fire
> independently
> ok 11 Verify no spurious error_env_tlob events without an active uprobe
> binding
> ok 12 uprobe violation: error_env_tlob and detail_env_tlob fire with correct
> fields
> # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
> tools/testing/selftests/verification/Makefile | 21 +-
> .../verification/test.d/tlob/ioctl.tc | 36 +
> .../verification/test.d/tlob/tracefs.tc | 17 +
> .../verification/test.d/tlob/uprobe_bind.tc | 34 +
> .../test.d/tlob/uprobe_detail_sleeping.tc | 47 ++
> .../test.d/tlob/uprobe_detail_waiting.tc | 60 ++
> .../verification/test.d/tlob/uprobe_multi.tc | 60 ++
> .../test.d/tlob/uprobe_no_event.tc | 19 +
> .../test.d/tlob/uprobe_violation.tc | 60 ++
> .../selftests/verification/tlob/Makefile | 21 +
> .../selftests/verification/tlob/tlob_ioctl.c | 626 ++++++++++++++++++
> .../selftests/verification/tlob/tlob_target.c | 138 ++++
> 12 files changed, 1138 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> create mode 100644
> tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> create mode 100644 tools/testing/selftests/verification/tlob/Makefile
> create mode 100644 tools/testing/selftests/verification/tlob/tlob_ioctl.c
> create mode 100644 tools/testing/selftests/verification/tlob/tlob_target.c
>
> diff --git a/tools/testing/selftests/verification/Makefile
> b/tools/testing/selftests/verification/Makefile
> index aa8790c22a71..b5584fd3762d 100644
> --- a/tools/testing/selftests/verification/Makefile
> +++ b/tools/testing/selftests/verification/Makefile
> @@ -1,8 +1,27 @@
> # SPDX-License-Identifier: GPL-2.0
> -all:
>
> TEST_PROGS := verificationtest-ktap
> TEST_FILES := test.d settings
> EXTRA_CLEAN := $(OUTPUT)/logs/*
>
> +# Subdirectories that provide helper binaries for the test runner.
> +# Each entry must contain a Makefile that accepts OUTDIR= and deposits
> +# its binaries there; verificationtest-ktap adds OUTDIR to PATH so
> +# the ftracetest require-checks resolve the binaries by name.
> +MONITOR_SUBDIRS := tlob
> +
> include ../lib.mk
> +
> +# Build and clean each monitor subdirectory.
> +all: $(patsubst %,_build_%,$(MONITOR_SUBDIRS))
> +
> +clean: $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +.PHONY: $(patsubst %,_build_%,$(MONITOR_SUBDIRS)) \
> + $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +$(patsubst %,_build_%,$(MONITOR_SUBDIRS)): _build_%:
> + $(MAKE) -C $* OUTDIR="$(OUTPUT)" TOOLS_INCLUDES="$(TOOLS_INCLUDES)"
> +
> +$(patsubst %,_clean_%,$(MONITOR_SUBDIRS)): _clean_%:
> + $(MAKE) -C $* OUTDIR="$(OUTPUT)" clean
> diff --git a/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> new file mode 100644
> index 000000000000..54ae249af9a6
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> @@ -0,0 +1,36 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob ioctl self-instrumentation (within/over-budget,
> error paths)
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +
> +[ -c /dev/rv ] || exit_unsupported
> +
> +echo 1 > monitors/tlob/enable
> +
> +# within budget: 50 ms threshold, 10 ms workload
> +"$TLOB_HELPER" within_budget
> +
> +# over budget in running state: 1 ms threshold, 100 ms busy-spin
> +"$TLOB_HELPER" over_budget_running
> +
> +# over budget in sleeping state: 3 ms threshold, 50 ms sleep
> +"$TLOB_HELPER" over_budget_sleeping
> +
> +# over budget in waiting state: 1 us threshold, sched_yield
> +"$TLOB_HELPER" over_budget_waiting
> +
> +# error paths
> +"$TLOB_HELPER" double_start
> +"$TLOB_HELPER" stop_no_start
> +
> +# per-thread isolation
> +"$TLOB_HELPER" multi_thread
> +
> +# bind against disabled monitor must return ENODEV, not crash
> +echo 0 > monitors/tlob/enable
> +"$TLOB_HELPER" not_enabled
> +echo 1 > monitors/tlob/enable
> +
> +echo 0 > monitors/tlob/enable
> diff --git a/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> new file mode 100644
> index 000000000000..5d1e7cc02498
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> @@ -0,0 +1,17 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob monitor tracefs interface (enable/disable and files)
> +# requires: tlob:monitor
> +
> +check_requires monitors/tlob/enable monitors/tlob/desc monitors/tlob/monitor
> +
> +# enable / disable via the enable file
> +echo 1 > monitors/tlob/enable
> +grep -q 1 monitors/tlob/enable
> +echo "tlob" >> enabled_monitors
> +grep -q tlob enabled_monitors
> +
> +echo 0 > monitors/tlob/enable
> +grep -q 0 monitors/tlob/enable
> +echo "!tlob" >> enabled_monitors
> +! grep -q "^tlob$" enabled_monitors
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> new file mode 100644
> index 000000000000..41e20d593855
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> @@ -0,0 +1,34 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe binding (visible in monitor file, removable,
> duplicate rejected)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > monitors/tlob/enable
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=5000000" >
> "$TLOB_MONITOR"
> +
> +# Binding must appear in monitor file with canonical hex-offset format.
> +grep -qE "^p ${UPROBE_TARGET}:0x[0-9a-f]+ 0x[0-9a-f]+ threshold=[0-9]+$"
> "$TLOB_MONITOR"
> +grep -q "threshold=5000000" "$TLOB_MONITOR"
> +
> +# Duplicate offset_start must be rejected.
> +! echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=9999" >
> "$TLOB_MONITOR" 2>/dev/null
> +
> +# Remove the binding; it must no longer appear.
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR"
> +! grep -q "^p .*:0x${busy_offset#0x} " "$TLOB_MONITOR"
> +
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > monitors/tlob/enable
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> new file mode 100644
> index 000000000000..2b8656e0fef1
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> @@ -0,0 +1,47 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail sleeping (sleeping_ns dominates when task
> blocks between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 5000 sleep &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 50 ms budget; task sleeps 200 ms per iteration -> sleeping_ns dominates.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=50000" >
> "$TLOB_MONITOR"
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> + sleep 0.1
> + grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> + i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$sleeping" -gt "$((running + waiting))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> new file mode 100644
> index 000000000000..0705854f24df
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail waiting (waiting_ns dominates when task is
> preempted between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +command -v chrt > /dev/null || exit_unsupported
> +command -v taskset > /dev/null || exit_unsupported
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_preempt_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET"
> tlob_preempt_work_done 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +cpu=0
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Register probe before the target starts so the start uprobe fires on the
> +# first entry to tlob_preempt_work. Budget: 500 ms.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=500000" >
> "$TLOB_MONITOR"
> +
> +# Target starts; start probe fires on tlob_preempt_work entry.
> +taskset -c "$cpu" "$UPROBE_TARGET" 5000 preempt &
> +busy_pid=$!
> +sleep 0.05
> +
> +# RT hog on the same CPU preempts the target; target stays in waiting state
> +# (runnable, off-CPU) until the budget expires -> waiting_ns dominates.
> +chrt -f 99 taskset -c "$cpu" sh -c 'while true; do :; done' 2>/dev/null &
> +hog_pid=$!
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> + sleep 0.1
> + grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> + i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$hog_pid" 2>/dev/null; wait "$hog_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +[ "$waiting" -gt "$((running + sleeping))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> new file mode 100644
> index 000000000000..c4b8f7108ae9
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test two uprobe bindings on same binary (different offsets
> fire independently)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +busy_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +sleep_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +sleep_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$busy_stop" ] || exit_unsupported
> +[ -n "$sleep_offset" ] || exit_unsupported
> +[ -n "$sleep_stop" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 & # busy mode: tlob_busy_work fires every 200 ms
> +busy_pid=$!
> +"$UPROBE_TARGET" 30000 sleep & # sleep mode: tlob_sleep_work fires every 200
> ms
> +sleep_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Binding A: 5 s budget on the busy probe - must not fire in 200 ms loops.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${busy_stop} threshold=5000000" >
> "$TLOB_MONITOR"
> +# Binding B: 10 ns budget on the sleep probe - fires on first invocation.
> +echo "p ${UPROBE_TARGET}:${sleep_offset} ${sleep_stop} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# Wait up to 2 s for error_env_tlob from binding B.
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> + sleep 0.1
> + grep -q "error_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> + i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +echo "-${UPROBE_TARGET}:${sleep_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$sleep_pid" 2>/dev/null; wait "$sleep_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +
> +[ "$found" = "1" ]
> +# error_env_tlob payload: label and clock variable must be present.
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "clk_elapsed="
> +# detail_env_tlob must appear alongside the error.
> +grep -q "detail_env_tlob" /sys/kernel/tracing/trace
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> new file mode 100644
> index 000000000000..4a74853346e3
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> @@ -0,0 +1,19 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test no spurious error_env_tlob events without an active
> uprobe binding
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +sleep 0.5
> +
> +! grep -q "error_env_tlob" /sys/kernel/tracing/trace
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> new file mode 100644
> index 000000000000..624fdb950f6b
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe violation (error_env_tlob and detail_env_tlob fire
> with correct fields)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 10 ns budget - fires almost immediately; task is busy-spinning on-CPU.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# wait up to 2 s for detail_env_tlob
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> + sleep 0.1
> + grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> + i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +# error_env_tlob event label must be budget_exceeded
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +
> +# detail_env_tlob must have all five fields with the correct threshold
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +echo "$line" | grep -q "pid="
> +echo "$line" | grep -q "threshold_us=10"
> +echo "$line" | grep -q "running_ns="
> +echo "$line" | grep -q "waiting_ns="
> +echo "$line" | grep -q "sleeping_ns="
> +
> +# Busy-spin keeps the task on-CPU: running_ns must exceed sleeping_ns.
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$running" -gt "$sleeping" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/tlob/Makefile
> b/tools/testing/selftests/verification/tlob/Makefile
> new file mode 100644
> index 000000000000..1bedf946cb34
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/Makefile
> @@ -0,0 +1,21 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Builds tlob selftest helper binaries.
> +#
> +# Invoked by ../Makefile; pass OUTDIR to control the output directory
> +# and TOOLS_INCLUDES for the in-tree UAPI -isystem flag.
> +
> +OUTDIR ?= $(CURDIR)/..
> +CFLAGS += $(TOOLS_INCLUDES)
> +
> +.PHONY: all
> +all: $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> +
> +$(OUTDIR)/tlob_ioctl: tlob_ioctl.c
> + $(CC) $(CFLAGS) -o $@ $< -lpthread
> +
> +$(OUTDIR)/tlob_target: tlob_target.c
> + $(CC) $(CFLAGS) -o $@ $<
> +
> +.PHONY: clean
> +clean:
> + $(RM) $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> diff --git a/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> new file mode 100644
> index 000000000000..abb4e2e80a2c
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> @@ -0,0 +1,626 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_ioctl.c - ioctl test driver and ELF utility for tlob selftests
> + *
> + * Usage: tlob_ioctl <subcommand> [args...]
> + *
> + * not_enabled - TRACE_START without monitor enabled -> ENODEV
> + * within_budget - sleep within budget -> 0
> + * over_budget_running - busy-spin past budget -> EOVERFLOW
> + * over_budget_sleeping - sleep past budget -> EOVERFLOW
> + * over_budget_waiting - sched_yield into waiting state -> EOVERFLOW
> + * double_start - two starts without stop -> EALREADY
> + * stop_no_start - stop without start -> EINVAL
> + * multi_thread - two fds: thread A within budget, thread B over
> + * bench - TRACE_START/STOP latency (TAP output, always
> passes)
> + * sym_offset <binary> <symbol> - print ELF file offset of symbol
> + *
> + * Exit: 0 = pass, 1 = fail, 2 = skip (device not available).
> + */
> +#define _GNU_SOURCE
> +#include <elf.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <sched.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <linux/rv.h>
> +
> +static int rv_fd = -1;
> +
> +static int open_rv(void)
> +{
> + struct rv_bind_args bind = { .monitor_name = "tlob" };
> +
> + rv_fd = open("/dev/rv", O_RDWR);
> + if (rv_fd < 0) {
> + fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> + return -1;
> + }
> + if (ioctl(rv_fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> + fprintf(stderr, "bind tlob: %s\n", strerror(errno));
> + close(rv_fd);
> + rv_fd = -1;
> + return -1;
> + }
> + return 0;
> +}
> +
> +static void busy_spin_us(unsigned long us)
> +{
> + struct timespec start, now;
> + unsigned long elapsed;
> +
> + clock_gettime(CLOCK_MONOTONIC, &start);
> + do {
> + clock_gettime(CLOCK_MONOTONIC, &now);
> + elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> + * 1000000000UL
> + + (unsigned long)(now.tv_nsec - start.tv_nsec);
> + } while (elapsed < us * 1000UL);
> +}
> +
> +static int trace_start(uint64_t threshold_us)
> +{
> + struct tlob_start_args args = {
> + .threshold_us = threshold_us,
> + };
> +
> + return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +}
> +
> +static int trace_stop(void)
> +{
> + return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +}
> +
> +/* Synchronous TRACE_START / TRACE_STOP tests */
> +
> +/* Bind to a disabled monitor must return ENODEV without crashing */
> +static int test_not_enabled(void)
> +{
> + struct rv_bind_args bind = { .monitor_name = "tlob" };
> + int fd;
> + int ret;
> +
> + fd = open("/dev/rv", O_RDWR);
> + if (fd < 0) {
> + fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> + return 2; /* skip */
> + }
> +
> + ret = ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind);
> + close(fd);
> +
> + if (ret == 0) {
> + fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> success\n");
> + return 1;
> + }
> + if (errno != ENODEV) {
> + fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int test_within_budget(void)
> +{
> + int ret;
> +
> + /* 50 ms budget */
> + if (trace_start(50000) < 0) {
> + fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + usleep(10000); /* 10 ms */
> + ret = trace_stop();
> + if (ret != 0) {
> + fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
> + ret, strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int test_over_budget_running(void)
> +{
> + int ret;
> +
> + /* 1 ms budget */
> + if (trace_start(1000) < 0) {
> + fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + busy_spin_us(100000); /* 100 ms */
> + ret = trace_stop();
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> + return 1;
> + }
> + if (errno != EOVERFLOW) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int test_over_budget_sleeping(void)
> +{
> + int ret;
> +
> + /* 3 ms budget */
> + if (trace_start(3000) < 0) {
> + fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + usleep(50000); /* 50 ms; sleeping time counts toward budget */
> + ret = trace_stop();
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> + return 1;
> + }
> + if (errno != EOVERFLOW) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int test_over_budget_waiting(void)
> +{
> + int ret;
> +
> + /* 1 us budget */
> + if (trace_start(1) < 0) {
> + fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + sched_yield(); /* running -> waiting -> running */
> + busy_spin_us(10); /* 10 us >> 1 us budget; hrtimer fires during spin
> */
> + ret = trace_stop();
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> + return 1;
> + }
> + if (errno != EOVERFLOW) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* Error-handling tests */
> +
> +static int test_double_start(void)
> +{
> + int ret;
> +
> + /* 10 s: large enough the hrtimer won't fire during the test */
> + if (trace_start(10000000ULL) < 0) {
> + fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + ret = trace_start(10000000);
> + if (ret == 0) {
> + fprintf(stderr, "second TRACE_START: expected EALREADY, got
> 0\n");
> + trace_stop();
> + return 1;
> + }
> + if (errno != EALREADY) {
> + fprintf(stderr, "second TRACE_START: expected EALREADY, got
> %s\n",
> + strerror(errno));
> + trace_stop();
> + return 1;
> + }
> + trace_stop();
> + return 0;
> +}
> +
> +static int test_stop_no_start(void)
> +{
> + int ret;
> +
> + /* Ensure clean state: ignore error from a stale entry */
> + trace_stop();
> +
> + ret = trace_stop();
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_STOP: expected EINVAL, got 0\n");
> + return 1;
> + }
> + if (errno != EINVAL) {
> + fprintf(stderr, "TRACE_STOP: expected EINVAL, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* Two threads, each with its own fd: A within budget, B over budget. */
> +
> +struct mt_thread_args {
> + uint64_t threshold_us;
> + unsigned long workload_us;
> + int busy;
> + int expect_eoverflow;
> + int result;
> +};
> +
> +static void *mt_thread_fn(void *arg)
> +{
> + struct mt_thread_args *a = arg;
> + struct tlob_start_args args = { .threshold_us = a->threshold_us };
> + struct rv_bind_args bind = { .monitor_name = "tlob" };
> + int fd;
> + int ret;
> +
> + fd = open("/dev/rv", O_RDWR);
> + if (fd < 0) {
> + fprintf(stderr, "thread open /dev/rv: %s\n",
> strerror(errno));
> + a->result = 1;
> + return NULL;
> + }
> + if (ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> + fprintf(stderr, "thread bind tlob: %s\n", strerror(errno));
> + close(fd);
> + a->result = 1;
> + return NULL;
> + }
> +
> + ret = ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> + if (ret < 0) {
> + fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
> + close(fd);
> + a->result = 1;
> + return NULL;
> + }
> +
> + if (a->busy)
> + busy_spin_us(a->workload_us);
> + else
> + usleep(a->workload_us);
> +
> + ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + if (a->expect_eoverflow) {
> + if (ret == 0 || errno != EOVERFLOW) {
> + fprintf(stderr, "thread: expected EOVERFLOW, got
> ret=%d errno=%s\n",
> + ret, strerror(errno));
> + close(fd);
> + a->result = 1;
> + return NULL;
> + }
> + } else {
> + if (ret != 0) {
> + fprintf(stderr, "thread: expected 0, got ret=%d
> errno=%s\n",
> + ret, strerror(errno));
> + close(fd);
> + a->result = 1;
> + return NULL;
> + }
> + }
> + close(fd);
> + a->result = 0;
> + return NULL;
> +}
> +
> +static int test_multi_thread(void)
> +{
> + pthread_t ta, tb;
> + struct mt_thread_args a = {
> + .threshold_us = 20000, /* 20 ms */
> + .workload_us = 5000, /* 5 ms sleep -> within budget
> */
> + .busy = 0,
> + .expect_eoverflow = 0,
> + };
> + struct mt_thread_args b = {
> + .threshold_us = 3000, /* 3 ms */
> + .workload_us = 30000, /* 30 ms spin -> over budget */
> + .busy = 1,
> + .expect_eoverflow = 1,
> + };
> +
> + pthread_create(&ta, NULL, mt_thread_fn, &a);
> + pthread_create(&tb, NULL, mt_thread_fn, &b);
> + pthread_join(ta, NULL);
> + pthread_join(tb, NULL);
> +
> + return (a.result || b.result) ? 1 : 0;
> +}
> +
> +/*
> + * Benchmark TRACE_START, TRACE_STOP, and round-trip ioctls.
> + * Output uses TAP '#' prefix; always returns 0.
> + */
> +#define BENCH_WARMUP 32
> +#define BENCH_N 1000
> +
> +static long long timespec_diff_ns(const struct timespec *a,
> + const struct timespec *b)
> +{
> + return (long long)(b->tv_sec - a->tv_sec) * 1000000000LL
> + + (b->tv_nsec - a->tv_nsec);
> +}
> +
> +static int test_bench(void)
> +{
> + struct tlob_start_args args = {
> + .threshold_us = 10000000ULL, /* 10 s */
> + };
> + struct timespec t0, t1;
> + long long total_start_ns = 0, total_stop_ns = 0, total_rt_ns = 0;
> + int i;
> +
> + /* warm up */
> + for (i = 0; i < BENCH_WARMUP; i++) {
> + if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) == 0)
> + ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + }
> +
> + /* start only */
> + for (i = 0; i < BENCH_N; i++) {
> + clock_gettime(CLOCK_MONOTONIC, &t0);
> + ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> + clock_gettime(CLOCK_MONOTONIC, &t1);
> + total_start_ns += timespec_diff_ns(&t0, &t1);
> + ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + }
> +
> + /* stop only */
> + for (i = 0; i < BENCH_N; i++) {
> + ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> + clock_gettime(CLOCK_MONOTONIC, &t0);
> + ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + clock_gettime(CLOCK_MONOTONIC, &t1);
> + total_stop_ns += timespec_diff_ns(&t0, &t1);
> + }
> +
> + /* round-trip */
> + clock_gettime(CLOCK_MONOTONIC, &t0);
> + for (i = 0; i < BENCH_N; i++) {
> + ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> + ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + }
> + clock_gettime(CLOCK_MONOTONIC, &t1);
> + total_rt_ns = timespec_diff_ns(&t0, &t1);
> +
> + printf("# start ioctl only: %lld ns/iter (N=%d, includes
> syscall)\n",
> + total_start_ns / BENCH_N, BENCH_N);
> + printf("# stop ioctl only: %lld ns/iter (N=%d, includes
> syscall)\n",
> + total_stop_ns / BENCH_N, BENCH_N);
> + printf("# start+stop roundtrip: %lld ns/iter (N=%d, includes 2
> syscalls)\n",
> + total_rt_ns / BENCH_N, BENCH_N);
> + return 0;
> +}
> +
> +/*
> + * Print the ELF file offset of <symname> in <binary>. Walks .symtab
> + * (falling back to .dynsym) and converts vaddr to file offset via PT_LOAD.
> + * Supports 32- and 64-bit ELF.
> + */
> +static int sym_offset(const char *binary, const char *symname)
> +{
> + int fd;
> + struct stat st;
> + void *map;
> + Elf64_Ehdr *ehdr;
> + Elf32_Ehdr *ehdr32;
> + int is64;
> + uint64_t sym_vaddr = 0;
> + int found = 0;
> + uint64_t file_offset = 0;
> +
> + fd = open(binary, O_RDONLY);
> + if (fd < 0) {
> + fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
> + return 1;
> + }
> + if (fstat(fd, &st) < 0) {
> + close(fd);
> + return 1;
> + }
> + map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> + close(fd);
> + if (map == MAP_FAILED) {
> + fprintf(stderr, "mmap: %s\n", strerror(errno));
> + return 1;
> + }
> +
> + ehdr = (Elf64_Ehdr *)map;
> + ehdr32 = (Elf32_Ehdr *)map;
> + if (st.st_size < 4 ||
> + ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
> + ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
> + ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
> + ehdr->e_ident[EI_MAG3] != ELFMAG3) {
> + fprintf(stderr, "%s: not an ELF file\n", binary);
> + munmap(map, (size_t)st.st_size);
> + return 1;
> + }
> + is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
> +
> + if (is64) {
> + Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr-
> >e_shoff);
> + Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
> + const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> + int si;
> +
> + /* prefer .symtab; fall back to .dynsym */
> + for (int pass = 0; pass < 2 && !found; pass++) {
> + const char *target = pass ? ".dynsym" : ".symtab";
> +
> + for (si = 0; si < ehdr->e_shnum && !found; si++) {
> + Elf64_Shdr *sh = &shdrs[si];
> + const char *name = shstrtab + sh->sh_name;
> +
> + if (strcmp(name, target) != 0)
> + continue;
> +
> + Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
> + const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> + Elf64_Sym *syms = (Elf64_Sym *)((char *)map +
> sh->sh_offset);
> + uint64_t nsyms = sh->sh_size /
> sizeof(Elf64_Sym);
> + uint64_t j;
> +
> + for (j = 0; j < nsyms; j++) {
> + if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> + sym_vaddr = syms[j].st_value;
> + found = 1;
> + break;
> + }
> + }
> + }
> + }
> +
> + if (!found) {
> + fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> + munmap(map, (size_t)st.st_size);
> + return 1;
> + }
> +
> + /* Convert vaddr to file offset via PT_LOAD segments */
> + Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr-
> >e_phoff);
> + int pi;
> +
> + for (pi = 0; pi < ehdr->e_phnum; pi++) {
> + Elf64_Phdr *ph = &phdrs[pi];
> +
> + if (ph->p_type != PT_LOAD)
> + continue;
> + if (sym_vaddr >= ph->p_vaddr &&
> + sym_vaddr < ph->p_vaddr + ph->p_filesz) {
> + file_offset = sym_vaddr - ph->p_vaddr + ph-
> >p_offset;
> + break;
> + }
> + }
> + } else {
> + /* 32-bit ELF */
> + Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32-
> >e_shoff);
> + Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
> + const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> + int si;
> + uint32_t sym_vaddr32 = 0;
> +
> + for (int pass = 0; pass < 2 && !found; pass++) {
> + const char *target = pass ? ".dynsym" : ".symtab";
> +
> + for (si = 0; si < ehdr32->e_shnum && !found; si++) {
> + Elf32_Shdr *sh = &shdrs[si];
> + const char *name = shstrtab + sh->sh_name;
> +
> + if (strcmp(name, target) != 0)
> + continue;
> +
> + Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
> + const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> + Elf32_Sym *syms = (Elf32_Sym *)((char *)map +
> sh->sh_offset);
> + uint32_t nsyms = sh->sh_size /
> sizeof(Elf32_Sym);
> + uint32_t j;
> +
> + for (j = 0; j < nsyms; j++) {
> + if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> + sym_vaddr32 =
> syms[j].st_value;
> + found = 1;
> + break;
> + }
> + }
> + }
> + }
> +
> + if (!found) {
> + fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> + munmap(map, (size_t)st.st_size);
> + return 1;
> + }
> +
> + Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32-
> >e_phoff);
> + int pi;
> +
> + for (pi = 0; pi < ehdr32->e_phnum; pi++) {
> + Elf32_Phdr *ph = &phdrs[pi];
> +
> + if (ph->p_type != PT_LOAD)
> + continue;
> + if (sym_vaddr32 >= ph->p_vaddr &&
> + sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
> + file_offset = sym_vaddr32 - ph->p_vaddr + ph-
> >p_offset;
> + break;
> + }
> + }
> + sym_vaddr = sym_vaddr32;
> + }
> +
> + munmap(map, (size_t)st.st_size);
> +
> + if (!file_offset && sym_vaddr) {
> + fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
> + (unsigned long)sym_vaddr);
> + return 1;
> + }
> +
> + printf("0x%lx\n", (unsigned long)file_offset);
> + return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + int rc;
> +
> + if (argc < 2) {
> + fprintf(stderr, "Usage: %s <subcommand> [args...]\n",
> argv[0]);
> + return 1;
> + }
> +
> + /* sym_offset does not need /dev/rv */
> + if (strcmp(argv[1], "sym_offset") == 0) {
> + if (argc < 4) {
> + fprintf(stderr, "Usage: %s sym_offset <binary>
> <symbol>\n",
> + argv[0]);
> + return 1;
> + }
> + return sym_offset(argv[2], argv[3]);
> + }
> +
> + /* not_enabled: monitor is disabled; bind must return ENODEV without
> open_rv() */
> + if (strcmp(argv[1], "not_enabled") == 0)
> + return test_not_enabled();
> +
> + if (open_rv() < 0)
> + return 2; /* skip */
> +
> + if (strcmp(argv[1], "bench") == 0)
> + rc = test_bench();
> + else if (strcmp(argv[1], "within_budget") == 0)
> + rc = test_within_budget();
> + else if (strcmp(argv[1], "over_budget_running") == 0)
> + rc = test_over_budget_running();
> + else if (strcmp(argv[1], "over_budget_sleeping") == 0)
> + rc = test_over_budget_sleeping();
> + else if (strcmp(argv[1], "over_budget_waiting") == 0)
> + rc = test_over_budget_waiting();
> + else if (strcmp(argv[1], "double_start") == 0)
> + rc = test_double_start();
> + else if (strcmp(argv[1], "stop_no_start") == 0)
> + rc = test_stop_no_start();
> + else if (strcmp(argv[1], "multi_thread") == 0)
> + rc = test_multi_thread();
> + else {
> + fprintf(stderr, "Unknown test: %s\n", argv[1]);
> + rc = 1;
> + }
> +
> + close(rv_fd);
> + return rc;
> +}
> diff --git a/tools/testing/selftests/verification/tlob/tlob_target.c
> b/tools/testing/selftests/verification/tlob/tlob_target.c
> new file mode 100644
> index 000000000000..0fdbc575d71d
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_target.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides three start/stop probe pairs, each designed to exercise a
> + * different dominant component of the detail_env_tlob ns breakdown:
> + *
> + * tlob_busy_work / tlob_busy_work_done - busy-spin: running_ns
> dominates
> + * tlob_sleep_work / tlob_sleep_work_done - nanosleep: sleeping_ns
> dominates
> + * tlob_preempt_work / tlob_preempt_work_done - busy-spin: waiting_ns
> dominates
> + * (needs an RT competitor on
> the same CPU)
> + *
> + * Usage: tlob_target <duration_ms> [mode]
> + *
> + * mode is one of: busy (default), sleep, preempt.
> + * Loops in 200 ms iterations until <duration_ms> has elapsed
> + * (0 = run for ~24 hours).
> + */
> +#define _GNU_SOURCE
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +
> +#ifndef noinline
> +#define noinline __attribute__((noinline))
> +#endif
> +
> +static inline int timespec_before(const struct timespec *a,
> + const struct timespec *b)
> +{
> + return a->tv_sec < b->tv_sec ||
> + (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +static void timespec_add_ms(struct timespec *ts, unsigned long ms)
> +{
> + ts->tv_sec += ms / 1000;
> + ts->tv_nsec += (long)(ms % 1000) * 1000000L;
> + if (ts->tv_nsec >= 1000000000L) {
> + ts->tv_sec++;
> + ts->tv_nsec -= 1000000000L;
> + }
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_busy_work_done(void)
> +{
> + /* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; busy-spin so running_ns dominates */
> +noinline void tlob_busy_work(unsigned long duration_ns)
> +{
> + struct timespec start, now;
> + unsigned long elapsed;
> +
> + clock_gettime(CLOCK_MONOTONIC, &start);
> + do {
> + clock_gettime(CLOCK_MONOTONIC, &now);
> + elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> + * 1000000000UL
> + + (unsigned long)(now.tv_nsec - start.tv_nsec);
> + } while (elapsed < duration_ns);
> +
> + tlob_busy_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_sleep_work_done(void)
> +{
> + /* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; nanosleep so sleeping_ns dominates */
> +noinline void tlob_sleep_work(unsigned long duration_ms)
> +{
> + struct timespec ts = {
> + .tv_sec = duration_ms / 1000,
> + .tv_nsec = (long)(duration_ms % 1000) * 1000000L,
> + };
> + nanosleep(&ts, NULL);
> + tlob_sleep_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_preempt_work_done(void)
> +{
> + /* empty: uprobe fires on entry */
> +}
> +
> +/*
> + * start probe; busy-spin so an RT competitor on the same CPU drives
> + * waiting_ns (prev_state==0 -> preempt event, task stays runnable off-CPU).
> + */
> +noinline void tlob_preempt_work(unsigned long duration_ms)
> +{
> + struct timespec start, now;
> + unsigned long elapsed;
> +
> + clock_gettime(CLOCK_MONOTONIC, &start);
> + do {
> + clock_gettime(CLOCK_MONOTONIC, &now);
> + elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> + * 1000000000UL
> + + (unsigned long)(now.tv_nsec - start.tv_nsec);
> + } while (elapsed < duration_ms * 1000000UL);
> +
> + tlob_preempt_work_done();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + unsigned long duration_ms = 0;
> + const char *mode = "busy";
> + struct timespec deadline, now;
> +
> + if (argc >= 2)
> + duration_ms = strtoul(argv[1], NULL, 10);
> + if (argc >= 3)
> + mode = argv[2];
> +
> + clock_gettime(CLOCK_MONOTONIC, &deadline);
> + timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
> +
> + do {
> + if (strcmp(mode, "sleep") == 0)
> + tlob_sleep_work(200);
> + else if (strcmp(mode, "preempt") == 0)
> + tlob_preempt_work(200);
> + else
> + tlob_busy_work(200 * 1000000UL);
> + clock_gettime(CLOCK_MONOTONIC, &now);
> + } while (timespec_before(&now, &deadline));
> +
> + return 0;
> +}
^ permalink raw reply
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13 7:53 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
kernel-team, Lance Yang
In-Reply-To: <agMj4ukhj1PkXXrN@gmail.com>
On 5/12/26 15:04, Breno Leitao wrote:
> On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>> unsigned long page_flags;
>>> bool retry = true;
>>> int hugetlb = 0;
>>> + bool is_reserved;
>>>
>>> if (!sysctl_memory_failure_recovery)
>>> panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>> * In fact it's dangerous to directly bump up page count from 0,
>>> * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>> */
>>> + /*
>>> + * Pages with PG_reserved set are not currently managed by the
>>> + * page allocator (memblock-reserved memory, driver reservations,
>>> + * etc.), so classify them as kernel-owned for reporting.
>>> + *
>>> + * Sample the flag before get_hwpoison_page(): in the
>>> + * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> + * reference before returning -EIO, after which page->flags may
>>> + * have been reset by the allocator.
>>> + */
>>> + is_reserved = PageReserved(p);
>>> +
>>> res = get_hwpoison_page(p, flags);
>>> if (!res) {
>>> if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>> }
>>> goto unlock_mutex;
>>> } else if (res < 0) {
>>> - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> + if (is_reserved)
>>> + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> + else
>>> + res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> + MF_IGNORED);
>>> goto unlock_mutex;
>>> }
>>>
>>>
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
>
> From what I read, it seems that error_states[0] = { reserved, reserved, MF_MSG_KERNEL, me_kernel }
> has been effectively dead code on the hwpoison-from-MCE path for a
> while.
>
> My v6 patch relabels the failure-path output to match what me_kernel() would
> have reported anyway.
>
>> This all looks very odd.
>>
>> Why would you even want to call get_hwpoison_page() in the first place if you
>> find PageReserved?
>
> Are you suggesting we should all the page action as soon as we detect the page
> is reserved and get out?
>
> Something as:
>
> if (PageReserved(p)) {
> res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> goto unlock_mutex;
> }
>
> res = get_hwpoison_page(p, flags);
Or you combine this patch with the other patch and let simply
get_hwpoison_page() check that, and return an appropriate error code for
unhandable that you can process here?
Like, maybe, returning -EIO directly?
res = get_hwpoison_page(p, flags);
switch (res) {
case 0: /* Success */
...
break
case -EIO: /* Unhandable kernel page. */
...
break;
case -EBUSY: /* Race, try again? */
...
break;
case ...
}
You can add more return codes as you see fit.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13 7:53 UTC (permalink / raw)
To: jane.chu, Breno Leitao, Miaohe Lin, Naoya Horiguchi,
Andrew Morton, Jonathan Corbet, Shuah Khan, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <816e3d8e-22d2-49a4-92ae-981568f38792@oracle.com>
On 5/12/26 19:58, jane.chu@oracle.com wrote:
>
>
> On 5/12/2026 1:17 AM, David Hildenbrand (Arm) wrote:
>> On 5/11/26 17:38, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Sample PG_reserved before the get_hwpoison_page() call. In the
>>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>>> reference before returning -EIO, after which the underlying page may
>>> have been freed and reallocated with page->flags reset; reading
>>> PageReserved(p) at that point would observe stale or unrelated state.
>>> The pre-call snapshot reflects what the page actually was at the
>>> time of the failure event.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/memory-failure.c | 19 ++++++++++++++++++-
>>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 866c4428ac7ef..f112fb27a8ff6 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>> unsigned long page_flags;
>>> bool retry = true;
>>> int hugetlb = 0;
>>> + bool is_reserved;
>>> if (!sysctl_memory_failure_recovery)
>>> panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>> * In fact it's dangerous to directly bump up page count from 0,
>>> * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>> */
>>> + /*
>>> + * Pages with PG_reserved set are not currently managed by the
>>> + * page allocator (memblock-reserved memory, driver reservations,
>>> + * etc.), so classify them as kernel-owned for reporting.
>>> + *
>>> + * Sample the flag before get_hwpoison_page(): in the
>>> + * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> + * reference before returning -EIO, after which page->flags may
>>> + * have been reset by the allocator.
>>> + */
>>> + is_reserved = PageReserved(p);
>>> +
>>> res = get_hwpoison_page(p, flags);
>>> if (!res) {
>>> if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>> }
>>> goto unlock_mutex;
>>> } else if (res < 0) {
>>> - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> + if (is_reserved)
>>> + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> + else
>>> + res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> + MF_IGNORED);
>>> goto unlock_mutex;
>>> }
>>>
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
>>
>> This all looks very odd.
>>
>> Why would you even want to call get_hwpoison_page() in the first place if you
>> find PageReserved?
>>
>
> Ah, good point!
> It seems to me that all unhandable pages should head out to identify_page_state:
>
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2411,6 +2411,10 @@ int memory_failure(unsigned long pfn, int flags)
> * In fact it's dangerous to directly bump up page count from 0,
> * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
> */
> +
> + if (!HWPoisonHandlable(page, flags)
> + goto identify_page_state;
> +
> res = get_hwpoison_page(p, flags);
> if (!res) {
> if (is_free_buddy_page(p)) {
That's one option, or we just let get_hwpoison_page() return clearer error
codes, let it take care of checking PageReserved, and process the error codes
return by get_hwpoison_page() in a better way.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13 7:54 UTC (permalink / raw)
To: Lance Yang
Cc: leitao, linmiaohe, nao.horiguchi, akpm, corbet, skhan, ljs,
vbabka, rppt, surenb, mhocko, shuah, rostedt, mhiramat,
mathieu.desnoyers, liam, linux-mm, linux-kernel, linux-doc,
linux-kselftest, linux-trace-kernel, kernel-team
In-Reply-To: <20260512124837.38883-1-lance.yang@linux.dev>
On 5/12/26 14:48, Lance Yang wrote:
>
> On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>> On 5/11/26 17:38, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Sample PG_reserved before the get_hwpoison_page() call. In the
>>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>>> reference before returning -EIO, after which the underlying page may
>>> have been freed and reallocated with page->flags reset; reading
>>> PageReserved(p) at that point would observe stale or unrelated state.
>>> The pre-call snapshot reflects what the page actually was at the
>>> time of the failure event.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/memory-failure.c | 19 ++++++++++++++++++-
>>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 866c4428ac7ef..f112fb27a8ff6 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>> unsigned long page_flags;
>>> bool retry = true;
>>> int hugetlb = 0;
>>> + bool is_reserved;
>>>
>>> if (!sysctl_memory_failure_recovery)
>>> panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>> * In fact it's dangerous to directly bump up page count from 0,
>>> * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>> */
>>> + /*
>>> + * Pages with PG_reserved set are not currently managed by the
>>> + * page allocator (memblock-reserved memory, driver reservations,
>>> + * etc.), so classify them as kernel-owned for reporting.
>>> + *
>>> + * Sample the flag before get_hwpoison_page(): in the
>>> + * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> + * reference before returning -EIO, after which page->flags may
>>> + * have been reset by the allocator.
>>> + */
>>> + is_reserved = PageReserved(p);
>>> +
>>> res = get_hwpoison_page(p, flags);
>>> if (!res) {
>>> if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>> }
>>> goto unlock_mutex;
>>> } else if (res < 0) {
>>> - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> + if (is_reserved)
>>> + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> + else
>>> + res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> + MF_IGNORED);
>>> goto unlock_mutex;
>>> }
>>>
>>>
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
>
> Looks like we never get that far ...
Right, likely that should be removed+cleaned up then.
--
Cheers,
David
^ permalink raw reply
* Re: [RFC PATCH v2 03/10] selftests/verification: fix verificationtest-ktap for out-of-tree execution
From: Gabriele Monaco @ 2026-05-13 8:32 UTC (permalink / raw)
To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <7368ee25b1b45c92beb14c05be366b71da585ca4.1778522945.git.wen.yang@linux.dev>
On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> verificationtest-ktap used a CWD-relative path (../ftrace/ftracetest)
> and a relative argument (../verification) for --rv. This works when
> the shell changes into the verification directory first, but breaks
> when the script is invoked directly - e.g. by the kselftest runner or
> vng - because the working directory is the kernel source root, not the
> script's own directory.
>
> Fix this by computing the script's directory from $0 with cd/dirname/pwd
> and using absolute paths for both the ftracetest invocation and the --rv
> argument. Also export the directory to PATH so that check_requires in
> the ftracetest framework can locate helper binaries.
>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
Just out of curiosity, how do you run the selftests?
Are you calling the script directly just to run /some/ of them?
The officially supported way is through make [1]:
make -C tools/testing/selftests TARGETS=verification run_tests
(though I find it faster to omit TARGETS and just do make -C
tools/testing/selftests/verification).
Calling with make should set up all paths as needed.
> ---
> tools/testing/selftests/verification/verificationtest-ktap | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/verification/verificationtest-ktap
> b/tools/testing/selftests/verification/verificationtest-ktap
> index 18f7fe324e2f..456b8578a307 100755
> --- a/tools/testing/selftests/verification/verificationtest-ktap
> +++ b/tools/testing/selftests/verification/verificationtest-ktap
> @@ -5,4 +5,6 @@
> #
> # Copyright (C) Arm Ltd., 2023
>
> -../ftrace/ftracetest -K -v --rv ../verification
> +dir=$(cd "$(dirname "$0")" && pwd)
> +export PATH="$dir:$PATH"
Then if you really really need to call it directly, do you need to override
PATH?
And isn't it clearer to do:
dir=$(realpath "$(dirname "$0")")
Thanks,
Gabriele
[1] - https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
> +"$dir/../ftrace/ftracetest" -K -v --rv "$dir"
^ permalink raw reply
* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-13 9:31 UTC (permalink / raw)
To: Wen Yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <cb929b8b-5bfb-4afe-ba50-45620c38ea96@linux.dev>
On Wed, 2026-05-13 at 13:32 +0800, Wen Yang wrote:
> Thanks for both messages. Two patches are ready; let me address
> your follow-up concerns before sending.
>
> 1. "all monitors reusing slots would suffer from it"
>
> Only RV_MON_PER_TASK uses the rv_get/put_task_monitor_slot()
> pool. RV_MON_GLOBAL and RV_MON_PER_CPU each have dedicated
> storage (a single static variable and a per-cpu variable) and
> never share slots across monitor types. The race is exclusive
> to PER_TASK, so fixing that variant's da_monitor_destroy() is
> the correct scope.
>
> 2. "LTL monitors don't even have monitoring"
>
> tracepoint_synchronize_unregister() does not rely on the
> monitoring flag at all. It is a system-wide barrier — it
> calls synchronize_rcu_tasks_trace() followed by
> synchronize_srcu(&tracepoint_srcu) — draining every in-flight
> tracepoint handler on every CPU regardless of which monitor
> dispatched it. LTL handlers are covered without any special
> treatment.
>
> The slot-ordering issue (patch 1) affects all per-task DA monitors,
> not only HA ones — "independent on HA" — because
> RV_PER_TASK_MONITOR_INIT equals CONFIG_RV_PER_TASK_MONITORS (one
> past the end of rv[]), so da_monitor_reset_all() overwrites whatever
> follows rv[] in task_struct whenever any per-task monitor is
> disabled.
Exactly, and since whatever follows .rv is randomised on a task_struct, this can
get quite nasty.
I included my version of the fix in the series in [1], but feel free to send
yours, you got there first ;)
>
> Also corrected "wwnr probe handler" to "stall probe handler" in
> patch 2 per your annotation.
>
While tracepoint_synchronize_unregister() does fix the race, I still see a timed
bomb in the way we do ha_monitor_reset_env().
Since we reused the same slots for per-task monitors (not for the others, you're
right I was brainfarting) we essentially don't know what happened before we do
da_monitor_init(), the same slot could have been used by an LTL monitor which
cannot even reliably clear the byte used by the monitoring flag.
Now, we either mandate all monitors to memset the entire slot (union
rv_task_monitor) or we don't assume anything about the slot's state during
initialisation. Any middle ground could reveal pesky bugs as soon as we refactor
the structs.
The latter idea is what I did in [1]. I believe that would make the
synchronisation superfluous.
What do you think?
Thanks,
Gabriele
[1] - https://lore.kernel.org/lkml/20260512140250.262190-8-gmonaco@redhat.com
> Please let me know if the above reasoning addresses your concerns.
>
>
> --
> Best wishes,
> Wen
>
> > >
> > > > include/rv/da_monitor.h | 18 ++++++++++++++++--
> > > > 1 file changed, 16 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > > > index 00ded3d5ab3f..d04bb3229c75 100644
> > > > --- a/include/rv/da_monitor.h
> > > > +++ b/include/rv/da_monitor.h
> > > > @@ -304,6 +304,20 @@ static int da_monitor_init(void)
> > > >
> > > > /*
> > > > * da_monitor_destroy - return the allocated slot
> > > > + *
> > > > + * Call tracepoint_synchronize_unregister() before reset_all() to close
> > > > + * the race where an in-flight non-HA probe handler sets monitoring=1
> > > > + * (without calling timer_setup()) after da_monitor_reset_all() has
> > > > + * already cleared the slot but before the caller's own sync completes.
> > > > + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> > > > + * the same slot would call timer_delete() on a never-initialised
> > > > + * timer_list, triggering ODEBUG warnings.
> > > > + *
> > > > + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> > > > + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> > > > + * The caller's own __rv_disable_monitor() issues a second sync after
> > > > + * returning from disable(); that redundant call is harmless on the
> > > > + * infrequent admin (enable/disable) path.
> > > > */
> > > > static inline void da_monitor_destroy(void)
> > > > {
> > > > @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> > > > WARN_ONCE(1, "Disabling a disabled monitor: "
> > > > __stringify(MONITOR_NAME));
> > > > return;
> > > > }
> > > > + tracepoint_synchronize_unregister();
> > > > + da_monitor_reset_all();
> > > > rv_put_task_monitor_slot(task_mon_slot);
> > > > task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > > > -
> > > > - da_monitor_reset_all();
> > > > }
> > > >
> > > > #elif RV_MON_TYPE == RV_MON_PER_OBJ
> >
^ permalink raw reply
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-13 9:35 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Alexei Starovoitov, Jiri Olsa, Masami Hiramatsu, Andrii Nakryiko,
bpf, linux-trace-kernel, Oleg Nesterov, Peter Zijlstra,
Ingo Molnar
In-Reply-To: <CAEf4Bza4sqLw8GHoq+MFBgsnhiJ_s91UUrMcA-paYeBr7=bz0A@mail.gmail.com>
On Tue, May 12, 2026 at 12:38:34PM -0700, Andrii Nakryiko wrote:
> On Tue, May 12, 2026 at 12:27 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > + /*
> > > + * We have nop10 (with first byte overwritten to int3),
> > > + * change it to:
> > > + * lea 0x80(%rsp), %rsp
> > > + * call tramp
> > > + *
> > > + * The first lea instruction skips the stack redzone so the call
> > > + * instruction can safely push return address on stack.
> > > + */
> >
> > typo: lea -128(%rsp), %rsp
ugh, thanks
> >
> > you can also do:
> >
> > add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.
>
> When I asked AI about this it explained that add instruction modifies
> flags, so it's not a good fit here. lea doesn't touch flags.
>
> >
> > Initially I didn't like this approach, since we just introduced
> > usdt nop5 and now need to recompile everything again,
> > but looking at the fix it's definitely simpler than alternatives
> > and doesn't have annoying limitations.
>
>
> yeah, limitations are annoying, especially with those global "DO NOT
> OPTIMIZE" flags... Jiri, let's polish your version and land it?
ok, will send it out
jirka
^ permalink raw reply
* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: David Hildenbrand (Arm) @ 2026-05-13 11:48 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
kernel-team, Lance Yang
In-Reply-To: <agMpqhpgmezqnaA_@gmail.com>
On 5/12/26 15:33, Breno Leitao wrote:
> On Tue, May 12, 2026 at 10:21:50AM +0200, David Hildenbrand (Arm) wrote:
>>
>>> }
>>> goto unlock_mutex;
>>> } else if (res < 0) {
>>> - if (is_reserved)
>>> + /*
>>> + * Promote a stable unhandlable kernel page diagnosed by
>>> + * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
>>> + * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
>>> + */
>>> + if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
>>> res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>
>>
>> It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
>> is unhandable if it is PG_reserved?
>
> Are you saying that we should identify if the page is PG_reserved in
> get_hwpoison_page() instead of in memory_failure(), as done in the
> previous patch ("mm/memory-failure: report MF_MSG_KERNEL for reserved
> pages") ?
>
>> Why can't we just return a special error code from get_hwpoison_page()? We ahve
>> plenty of errno values to chose from.
>
> Something like:
>
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 866c4428ac7ef..0a6d83575833e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -878,7 +878,7 @@ static const char *action_name[] = {
> };
>
> static const char * const action_page_types[] = {
> - [MF_MSG_KERNEL] = "reserved kernel page",
> + [MF_MSG_KERNEL] = "unrecoverable kernel page",
> [MF_MSG_KERNEL_HIGH_ORDER] = "high-order kernel page",
> [MF_MSG_HUGE] = "huge page",
> [MF_MSG_FREE_HUGE] = "free huge page",
> @@ -1394,6 +1394,21 @@ static int get_any_page(struct page *p, unsigned long flags)
> int ret = 0, pass = 0;
> bool count_increased = false;
>
> + if (PageReserved(p)) {
> + ret = -ENOTRECOVERABLE;
> + goto out;
> + }
> +
> if (flags & MF_COUNT_INCREASED)
> count_increased = true;
>
> @@ -1422,7 +1437,7 @@ static int get_any_page(struct page *p, unsigned long flags)
> shake_page(p);
> goto try_again;
> }
> - ret = -EIO;
> + ret = -ENOTRECOVERABLE;
> goto out;
> }
> }
> @@ -1441,10 +1456,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> goto try_again;
> }
> put_page(p);
> - ret = -EIO;
> + ret = -ENOTRECOVERABLE;
> }
> out:
> - if (ret == -EIO)
> + if (ret == -EIO || ret == -ENOTRECOVERABLE)
> pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
>
> return ret;
> @@ -2431,6 +2448,9 @@ int memory_failure(unsigned long pfn, int flags)
> res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> }
> goto unlock_mutex;
> + } else if (res == -ENOTRECOVERABLE) {
> + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> + goto unlock_mutex;
> } else if (res < 0) {
> res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> goto unlock_mutex;
That might probably read nicer as
switch (res) {
case 0: ...
case 1: ...
case -ENOTRECOVERABLE: ...
case ...
default:
}
>
>
> If that is what you are suggestion, maybe we can create another
> MF_MSG_RESERVED? and another return value for get_any_page() to track
> the reserve pages ?
I guess "reserved" is really just like most other kernel pages. So I wouldn't
special-case them here.
Or would there be a good reason?
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] tracing: Switch trace_recursion_record.c code over to use guard()
From: Steven Rostedt @ 2026-05-13 12:34 UTC (permalink / raw)
To: Yash Suthar
Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
skhan, me
In-Reply-To: <CAPfzD4kiEMFdbK36X1_f1+Vn=v3nfvUODZJuJ40sNJ_fRr9zKA@mail.gmail.com>
On Tue, 12 May 2026 20:11:08 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:
> Gentle ping.
Hi,
What's the rush? This is just a clean up change. There's no feature here
that you need is there?
If you see it in patchwork[1], it's not lost. I just have other things
ahead of it. I usually process cleanup code last.
Thanks,
-- Steve
[1] https://patchwork.kernel.org/project/linux-trace-kernel/patch/20260502174741.39636-1-yashsuthar983@gmail.com/
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox