Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH] rtla/tests: Fix pgrep filter in get_workload_pids.sh
From: Wander Lairson Costa @ 2026-06-10 12:52 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Steven Rostedt, John Kacur, Luis Goncalves, Crystal Wood,
	Costa Shulyupin, LKML, linux-trace-kernel
In-Reply-To: <20260604140547.3616495-1-tglozar@redhat.com>

On Thu, Jun 04, 2026 at 04:05:47PM +0200, Tomas Glozar wrote:
> Multiple runtime tests in RTLA rely on the get_workload_pids() shell
> helper function to get the PIDs of both kernel and user workloads.
> 
> On some systems (e.g. Fedora 43), pgrep matches kernel thread names
> including square brackets: "[osnoise/0]"; on other systems (e.g.
> RHEL 9.8), brackets are not included: "osnoise/0".
> 
> Accept both as valid workload PIDs rather that just the non-bracket form
> to make the tests work on all systems.
> 
> Fixes: a98dad63cda3 ("rtla/tests: Add runtime test for -k and -u options")
> Reported-by: Crystal Wood <crwood@redhat.com>
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---
> 
> Note: the file touched by this commit is included by .gitignore, that is
> an error that will be fixed by [1].
> 
> [1] https://lore.kernel.org/linux-trace-kernel/20260601091835.3118094-1-tglozar@redhat.com/
> 

Reviewed-by: Wander Lairson Costa <wander@redhat.com>


^ permalink raw reply

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: David Laight @ 2026-06-10 11:06 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Markus Schneider-Pargmann (The Capable Hub), Steven Rostedt,
	Mathieu Desnoyers, Heiko Carstens, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260610171740.c30c43c5faee0beac3ad7546@kernel.org>

On Wed, 10 Jun 2026 17:17:40 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Hi Markus,
> 
> Thanks for ping me.
> 
> On Tue, 28 Apr 2026 10:30:29 +0200
> "Markus Schneider-Pargmann (The Capable Hub)" <msp@baylibre.com> wrote:
> 
> > fp pointer and unsigned long have the same size on all relevant
> > architectures that build Linux. Furthermore this struct is only used in
> > architectures that do not set ARCH_DEFINE_ENCODE_FPROBE_HEADER which is
> > set only for 64bit architectures (apart from LoongArch).
> > 
> > Both fields are aligned on these architectures so the struct with
> > __packed and without it are the same.
> > 
> > Remove the __packed as it is unnecessary.
> > 
> > Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")  
> 
> NOTE: This is not a Fix, but just cleanup or minor update. Or, you have
> any problem with this __packed attribute?
> 
> Unless there is no problem (or any concern), I would like to keep this
> as it is.

There is likely to be a difference on architectures that fault misaligned
accesses.
On those gcc will use multiple byte-sized accesses (and a log of shifts etc)
for code that accesses those members because it will assume that the
structure itself can be misaligned.

So you only want __packed on structures that might be misaligned and those
that contain misaligned members.

If the structure is only guaranteed to be 32bit aligned then use __packed
__aligned(4) so that two 32bit accesses get used instead of 8 8bit ones.

-- David

> 
> Thank you,
> 
> > Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
> > ---
> >  kernel/trace/fprobe.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> > index cc49ebd2a773..21751dcdb7b9 100644
> > --- a/kernel/trace/fprobe.c
> > +++ b/kernel/trace/fprobe.c
> > @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
> >  struct __fprobe_header {
> >  	struct fprobe *fp;
> >  	unsigned long size_words;
> > -} __packed;
> > +};
> >  
> >  #define FPROBE_HEADER_SIZE_IN_LONG	SIZE_IN_LONG(sizeof(struct __fprobe_header))
> >  
> > 
> > ---
> > base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> > change-id: 20260427-topic-fprobe-packed-v7-1-f44f9bbdedf6
> > 
> > Best regards,
> > --  
> > Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
> >   
> 
> 


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 10:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ah-0CyZurn5D1ezY@parvat>

On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > 
> >    __GFP_THISNODE cannot be overloaded to do anything useful here.
> 
> Let me clarify, I meant to say, let's use a nodemask for allocation
> and __GFP_THISNODE gets us to the node we desire, if that is the only
> node. My earlier comment might not have been clear.
> 

I've been tested an stripped back patch set where I drop all FALLBACK
entries for private nodes (including for itself) and only keep the
NOFALLBACK entry for private nodes.

This effectively isolates the nodes for any allocation without
__GFP_THISNODE.

This also precludes these nodes from ever using non-mbind mempolicies,
which I think is a completely reasonable compromise and something I was
already expecting we would do.

Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
which causes spillage into private nodes because slub allows private
nodes in its mask.  I think this is fixable.

I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
code, etc), but it seems like fully dropping the FALLBACK entries and
requiring __GFP_THISNODE might be sufficient.

~Gregory

^ permalink raw reply

* Re: [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-10  9:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <cf2bb24e-9341-4ded-b238-064dca442a92@kernel.org>

On Tue, Jun 09, 2026 at 08:41:25PM +0200, David Hildenbrand (Arm) wrote:
> On 6/9/26 18:15, Breno Leitao wrote:
> > On Tue, Jun 09, 2026 at 04:41:01PM +0200, David Hildenbrand (Arm) wrote:

> >> a) HWPoisonKernelOwned: this is not the common style for us to name functions.
> >>
> >> is_kernel_owned_page() or sth like that would do.
> > 
> > Ack, I will rename it is_kernel_owned_page()
> > 
> > In my defence, most of the functions similar to HWPoisonKernelOwned()
> > has this name format, and I got this discussion earlier (with Lance?
> > I think). Here are the similar function names in that file:
> > 
> >  * HWPoisonHandlable
> >  * PageHWPoisonTakenOff()
> >  * SetPageHWPoisonTakenOff
> 
> Some of these probably date back to our old way of handling page flags and
> things, like PageLRU.
> 
> But we really should stop :)

Ack!

> > I will update in the new version.
> 
> Thanks! Probably best to wait a bit, the merge window is coming up either way,
> so this will have to wait a bit either way.

no hurry at all,

Thanks for the review,
--breno

^ permalink raw reply

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Markus Schneider-Pargmann @ 2026-06-10  9:20 UTC (permalink / raw)
  To: Masami Hiramatsu, Markus Schneider-Pargmann (The Capable Hub)
  Cc: Steven Rostedt, Mathieu Desnoyers, Heiko Carstens, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260610171740.c30c43c5faee0beac3ad7546@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]

Hi Masami,

On Wed Jun 10, 2026 at 10:17 AM CEST, Masami Hiramatsu wrote:
> Hi Markus,
>
> Thanks for ping me.
>
> On Tue, 28 Apr 2026 10:30:29 +0200
> "Markus Schneider-Pargmann (The Capable Hub)" <msp@baylibre.com> wrote:
>
>> fp pointer and unsigned long have the same size on all relevant
>> architectures that build Linux. Furthermore this struct is only used in
>> architectures that do not set ARCH_DEFINE_ENCODE_FPROBE_HEADER which is
>> set only for 64bit architectures (apart from LoongArch).
>> 
>> Both fields are aligned on these architectures so the struct with
>> __packed and without it are the same.
>> 
>> Remove the __packed as it is unnecessary.
>> 
>> Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
>
> NOTE: This is not a Fix, but just cleanup or minor update. Or, you have
> any problem with this __packed attribute?

Thanks, yes it is not fixing a bug, I can remove this.

>
> Unless there is no problem (or any concern), I would like to keep this
> as it is.

There is currently no problem with __packed in the upstream kernel. I
just thought this would be a good cleanup to remove the unnecessary
attribute. I am working on CHERI architectures where pointers have
capabilities. __packed breaks these capability tags and therefore
doesn't work on CHERI. When looking into why this struct has a __packed
attribute I didn't see a reason, so I thought this would be a good patch
for upstream as well even though CHERI is not yet relevant for upstream
linux.

Best
Markus

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 289 bytes --]

^ permalink raw reply

* Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()
From: Masami Hiramatsu @ 2026-06-10  8:29 UTC (permalink / raw)
  To: Tengda Wu
  Cc: Petr Mladek, Masami Hiramatsu, Peter Zijlstra, Steven Rostedt,
	Mathieu Desnoyers, Alexei Starovoitov, linux-trace-kernel,
	linux-kernel, live-patching
In-Reply-To: <e016536c-df99-4de8-a8fb-6ac50932fd2f@huaweicloud.com>

On Tue, 9 Jun 2026 19:12:41 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:

> 
> 
> On 2026/6/9 17:43, Petr Mladek wrote:
> > Added live-patching mailing list.
> > 
> > On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
> >> The current check in rethook_find_ret_addr() prevents obtaining a return
> >> address when the target task is marked as running. However, this condition
> >> is both insufficient for correctness and unnecessary for its intended
> >> purpose.
> >>
> >> The check is inherently racy: a task can begin running on another CPU
> >> immediately after task_is_running() returns false, potentially leading to
> >> concurrent modification of rethook data structures while the iteration is
> >> in progress.
> >>
> >> Rather than trying to fix this unreliable check deep in the unwinding
> >> path, simply remove it. The iteration is already safe from crashes because
> >> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
> >> even if the iteration goes off the rails and returns invalid information,
> >> it will not crash. Callers that require consistency must provide a safe
> >> context themselves.
> >>
> >> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
> >> ---
> >> v3: Improve commit message: clarify safety semantics and document that RCU guarantees no crash.
> >> v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@huaweicloud.com/
> >> v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/
> >>
> >> --- a/kernel/trace/rethook.c
> >> +++ b/kernel/trace/rethook.c
> >> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
> >>  	if (WARN_ON_ONCE(!cur))
> >>  		return 0;
> >>  
> >> -	if (tsk != current && task_is_running(tsk))
> >> -		return 0;
> >> -
> > 
> > The description of the function should be updated as well. It still
> > mentions:
> > 
> >  * The @tsk must be 'current' or a task which is not running.
> > 
> > Instead it should explain that it safe to call the function even
> > on another running tasks but the returned address is not reliable
> > then.
> > 
> 
> Oh, I forgot that. Thanks for pointing it out.

Yeah, but it should be updated to explain what you need to do.
For example call it should hold RCU, or use for current.

Thanks,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-06-10  8:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Masami Hiramatsu, Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <CAEf4BzbsB-PRZRqkz37YK5wYNR=ajZv2N7tEznpZM0rdyYC_xA@mail.gmail.com>

On Tue, Jun 09, 2026 at 09:43:15AM -0700, Andrii Nakryiko wrote:
> On Tue, Jun 9, 2026 at 4:44 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote:
> > > On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > > redzone area with call instruction storing return address on stack
> > > > where user code may keep temporary data without adjusting rsp.
> > > >
> > > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > > instruction, so we can squeeze another instruction to escape the
> > > > redzone area before doing the call, like:
> > > >
> > > >   lea -0x80(%rsp), %rsp
> > > >   call tramp
> > > >
> > > > Note the lea instruction is used to adjust the rsp register without
> > > > changing the flags.
> > > >
> > > > We use nop10 and following transformation to optimized instructions
> > > > above and back as suggested by Peterz [2].
> > > >
> > > > Optimize path (int3_update_optimize):
> > > >
> > > >   1) Initial state after set_swbp() installed the uprobe:
> > > >       cc 2e 0f 1f 84 00 00 00 00 00
> > > >
> > > >      From offset 0 this is INT3 followed by the tail of the original
> > > >      10-byte NOP.
> > > >
> > > >      After a previous unoptimization bytes 5..9 may still contain the
> > > >      old call instruction, which remains valid for threads already there.
> > > >
> > > >   2) Rewrite the LEA tail and call displacement:
> > > >       cc [8d 64 24 80 e8 d0 d1 d2 d3]
> > > >
> > > >      From offset 0 this traps on the uprobe INT3.  Bytes 1..9 are not
> > > >      executable entry points while byte 0 is trapped.
> > > >
> > > >   3) Publish the first LEA byte:
> > > >       [48] 8d 64 24 80 e8 d0 d1 d2 d3
> > > >
> > > >      From offset 0 this is:
> > > >         lea -0x80(%rsp), %rsp
> > > >         call <uprobe-trampoline>
> > > >
> > > > Unoptimize path (int3_update_unoptimize):
> > > >
> > > >   1) Initial optimized state:
> > > >       48 8d 64 24 80 e8 d0 d1 d2 d3
> > > >      Same as 3) above.
> > > >
> > > >   2) Trap new entries before restoring the NOP bytes:
> > > >       [cc] 8d 64 24 80 e8 d0 d1 d2 d3
> > > >
> > > >      From offset 0 this traps. A thread that had already executed the
> > > >      LEA can still reach the intact CALL at offset 5.
> > > >
> > > >   3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > >      and byte 5 as CALL.
> > > >       cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > >
> > > >      From offset 0 this still traps. Offset 5 is still the CALL for any
> > > >      thread that was already past the first LEA byte.
> > > >
> > > >   4) Publish the first byte of the original NOP:
> > > >       [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
> > > >
> > > >      From offset 0 this is the restored 10-byte NOP; the CALL opcode and
> > > >      displacement are now only NOP operands.  Offset 5 still decodes as
> > > >      CALL for a thread that was already there.
> > > >
> > > >      Tthere is only a single target uprobe-trampoline for the given nop10
> > > >      instruction address, so the CALL instruction will not be changed across
> > > >      unoptimization/optimization cycles.
> > > >      Therefore, any task that is preempted at the CALL instruction is guaranteed
> > > >      to observe that CALL and not anything else.
> > > >
> > > > Note as explained in [2] we need to use following nop10:
> > > >        PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
> > > > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> > > >
> > > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> > > > attribute in is_prefix_bad function.
> > > >
> > > > Also changing the uprobe syscall error when called out of uprobe
> > > > trampoline to -EPROTO, so we are able to detect the fixed kernel.
> > > >
> > > > The optimized uprobe performance stays the same:
> > > >
> > > >         uprobe-nop     :    3.129 ± 0.013M/s
> > > >         uprobe-push    :    3.045 ± 0.006M/s
> > > >         uprobe-ret     :    1.095 ± 0.004M/s
> > > >   -->   uprobe-nop10   :    7.170 ± 0.020M/s
> > > >         uretprobe-nop  :    2.143 ± 0.021M/s
> > > >         uretprobe-push :    2.090 ± 0.000M/s
> > > >         uretprobe-ret  :    0.942 ± 0.000M/s
> > > >   -->   uretprobe-nop10:    3.381 ± 0.003M/s
> > > >         usdt-nop       :    3.245 ± 0.004M/s
> > > >   -->   usdt-nop10     :    7.256 ± 0.023M/s
> > > >
> > > > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > > > Assisted-by: Codex:GPT-5.5
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > >  arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
> > > >  1 file changed, 190 insertions(+), 65 deletions(-)
> > > >
> > >
> > > [...]
> > >
> > > > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > > >         smp_text_poke_sync_each_cpu();
> > > >
> > > >         /*
> > > > -        * Write first byte.
> > > > +        * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > > +        *    and byte 5 as CALL:
> > > > +        *    cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > > +        */
> > > > +       ctx.expect = EXPECT_SWBP_OPTIMIZED;
> > > > +       err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> > > > +                          LEA_INSN_SIZE - 1, verify_insn,
> > > > +                          true /* is_register */, false /* do_update_ref_ctr */,
> > >
> > > tbh, it's quite subtle and non-obvious why is_register should be set
> > > to true first two times (and especially that is_register and
> > > do_update_ref_ctr are implicitly connected), not sure how to make it
> > > cleaner, but maybe leave a short comment explaining this twice
> > > register, once unregister sequence?
> >
> > ok, I came up with comment below
> >
> > thanks,
> > jirka
> >
> >
> > ---
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index de544516ea70..92449f34c005 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
> >         int err;
> >
> >         /*
> > +        * Note the first two uprobe_write calls use is_register=true, because they
> > +        * are intermediate patching states while the probe is still active.
> 
> this doesn't really explain why is_register=true is the right one. It
> actually doesn't matter as long as do_update_ref_ctr=true, isn't that
> right? So maybe just to avoid a bit of confusion let's pass
> is_register=false and do_update_ref_ctr=false, and in the comment
> explain as you said that it's intermediate update and we don't want to
> update refctr just yet until the very last step?

apart from refctr update there's also different way the concerned
page is managed, IIUC:

with is_register=true we force to get exclusive anonymous page for
the update (or pin the existing one)

with is_register=false we try to zap the private anonymous page and
return the mapping to the original page

there are several comments on this in uprobe_write/__uprobe_write

how about the update below

jirka


---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index de544516ea70..09f5ff71227c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1011,6 +1011,16 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
 	int err;
 
 	/*
+	 * Note the first two uprobe_write calls use is_register=true, because they
+	 * are intermediate patching states while the probe is still active, so
+	 * we force the exclusive anonymous page for the update.
+	 * Also we use do_update_ref_ctr=false because refctr was already updated by
+	 * the initial int3 install.
+	 *
+	 * The last uprobe_write to nop10 instruction is called with is_register=false
+	 * and do_update_ref_ctr=true to trigger the refctr update and to instruct
+	 * uprobe_write to zap the anonymous page if it now matches the file page.
+	 *
 	 * 1) Initial optimized state:
 	 *    48 8d 64 24 80 e8 d0 d1 d2 d3
 	 *

^ permalink raw reply related

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Masami Hiramatsu @ 2026-06-10  8:17 UTC (permalink / raw)
  To: Markus Schneider-Pargmann (The Capable Hub)
  Cc: Steven Rostedt, Mathieu Desnoyers, Heiko Carstens, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260428-topic-fprobe-packed-v7-1-v1-1-9abc9b866b4c@baylibre.com>

Hi Markus,

Thanks for ping me.

On Tue, 28 Apr 2026 10:30:29 +0200
"Markus Schneider-Pargmann (The Capable Hub)" <msp@baylibre.com> wrote:

> fp pointer and unsigned long have the same size on all relevant
> architectures that build Linux. Furthermore this struct is only used in
> architectures that do not set ARCH_DEFINE_ENCODE_FPROBE_HEADER which is
> set only for 64bit architectures (apart from LoongArch).
> 
> Both fields are aligned on these architectures so the struct with
> __packed and without it are the same.
> 
> Remove the __packed as it is unnecessary.
> 
> Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")

NOTE: This is not a Fix, but just cleanup or minor update. Or, you have
any problem with this __packed attribute?

Unless there is no problem (or any concern), I would like to keep this
as it is.

Thank you,

> Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
> ---
>  kernel/trace/fprobe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index cc49ebd2a773..21751dcdb7b9 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
>  struct __fprobe_header {
>  	struct fprobe *fp;
>  	unsigned long size_words;
> -} __packed;
> +};
>  
>  #define FPROBE_HEADER_SIZE_IN_LONG	SIZE_IN_LONG(sizeof(struct __fprobe_header))
>  
> 
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260427-topic-fprobe-packed-v7-1-f44f9bbdedf6
> 
> Best regards,
> --  
> Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] task_work: add tracepoints for task_work callbacks
From: Peter Zijlstra @ 2026-06-10  8:00 UTC (permalink / raw)
  To: Imran Khan
  Cc: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260610041408.2461637-1-imran.f.khan@oracle.com>

On Wed, Jun 10, 2026 at 12:14:08PM +0800, Imran Khan wrote:
> task_work tracepoints can be enabled by:
> 
>     echo 1 > /sys/kernel/tracing/events/task_work/enable
> 
> and trace logs would look like:
> 
>     ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a508 func=____fput notify=TWA_RESUME
>     ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a508 ret=0
>     ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a5c8 func=____fput notify=TWA_RESUME
>     ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a5c8 ret=0
>     ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a688 func=____fput notify=TWA_RESUME
>     ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a688 ret=0
>     ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a748 func=____fput notify=TWA_RESUME
>     ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a748 ret=0
>     ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a748 func=____fput
>     ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a748 func=____fput
>     ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a688 func=____fput
>     ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a688 func=____fput
>     ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a5c8 func=____fput
>     ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a5c8 func=____fput
>     ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a508 func=____fput
>     ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a508 func=____fput
> 
> formatted as:
>     target_comm=<comm of target task>
>     target_pid=<pid of target task>
>     work=<callback_head *>
>     func=<callback_head->func>
>     notify=<way to notify the target task>
>     comm=<comm of current task executing func>
>     pid=<pid of current task executing func>

And not a single justification for all this nonsense :-( So much ugly
and no gain...

^ permalink raw reply

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Markus Schneider-Pargmann @ 2026-06-10  7:17 UTC (permalink / raw)
  To: Markus Schneider-Pargmann (The Capable Hub), Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Heiko Carstens
  Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <20260428-topic-fprobe-packed-v7-1-v1-1-9abc9b866b4c@baylibre.com>

[-- Attachment #1: Type: text/plain, Size: 1506 bytes --]

Hi,

On Tue Apr 28, 2026 at 10:30 AM CEST, Markus Schneider-Pargmann (The Capable Hub) wrote:
> fp pointer and unsigned long have the same size on all relevant
> architectures that build Linux. Furthermore this struct is only used in
> architectures that do not set ARCH_DEFINE_ENCODE_FPROBE_HEADER which is
> set only for 64bit architectures (apart from LoongArch).
>
> Both fields are aligned on these architectures so the struct with
> __packed and without it are the same.
>
> Remove the __packed as it is unnecessary.

Friendly ping on this.

Best
Markus

>
> Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
> Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
> ---
>  kernel/trace/fprobe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index cc49ebd2a773..21751dcdb7b9 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
>  struct __fprobe_header {
>  	struct fprobe *fp;
>  	unsigned long size_words;
> -} __packed;
> +};
>  
>  #define FPROBE_HEADER_SIZE_IN_LONG	SIZE_IN_LONG(sizeof(struct __fprobe_header))
>  
>
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260427-topic-fprobe-packed-v7-1-f44f9bbdedf6
>
> Best regards,
> --  
> Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 289 bytes --]

^ permalink raw reply

* Re: [PATCH 0/2] arm64: ftrace: support DIRECT_CALLS without CALL_OPS
From: Clayton Craft @ 2026-06-10  4:42 UTC (permalink / raw)
  To: Jose Fernandez (Anthropic), Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Catalin Marinas, Will Deacon, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt
  Cc: linux-kernel, linux-trace-kernel, linux-arm-kernel, llvm, bpf,
	Florent Revest, Puranjay Mohan, Xu Kuohai
In-Reply-To: <20260609-arm64-ftrace-direct-calls-v1-0-4a46f266697f@linux.dev>

On Mon Jun 8, 2026 at 10:19 PM PDT, Jose Fernandez (Anthropic) wrote:
> On arm64, HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is currently selected
> only when DYNAMIC_FTRACE_WITH_CALL_OPS is available. CALL_OPS, in
> turn, is mutually exclusive with kCFI: the pre-function NOPs it needs
> would change the offset of the pre-function type hash (see
> baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")),
> and the compiler support needed to reconcile the two does not exist
> yet.
>
> The result is that a CONFIG_CFI=y arm64 kernel has no
> ftrace direct calls at all, so register_fentry() fails with -ENOTSUPP
> and no BPF trampoline can attach: fentry/fexit, fmod_ret and BPF LSM
> programs are all unavailable. Deployments that want both kCFI
> hardening and BPF-based security monitoring currently have to give
> one of them up. systemd's bpf-restrict-fs feature hits this today:
> https://lore.kernel.org/all/20250610232418.GA3544567@ax162/
>
> CALL_OPS is an optimization for direct calls, not a dependency.
> In-BL-range trampolines are reached by a direct branch without
> consulting the ops pointer, and out-of-range trampolines already
> fall back to ftrace_caller, where the DIRECT_CALLS machinery
> (call_direct_funcs() storing the trampoline in ftrace_regs, the
> ftrace_caller tail-call) is gated on DIRECT_CALLS alone. s390 and
> loongarch ship HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS this way,
> without having CALL_OPS at all.
>
> Patch 1 prepares ftrace_modify_call() to build without CALL_OPS by
> widening its #ifdef and using the existing ftrace_rec_update_ops()
> wrapper (no functional change for current configurations). Patch 2
> drops the CALL_OPS requirement from the DIRECT_CALLS select.
>
> Configurations that keep CALL_OPS (clang !CFI, and GCC without
> CC_OPTIMIZE_FOR_SIZE) are unchanged. We verified this: in an arm64
> clang build, every object file is byte-identical before and after
> the series except ftrace.o itself, and its disassembly is identical.
> CFI builds (and GCC -Os builds) gain working direct calls, with
> out-of-range attachments taking the ftrace_caller dispatch path
> instead of the per-callsite fast path.
>
> We tested on a 6.18.y-based kernel and on this base with clang
> kCFI builds (CONFIG_CFI=y, enforcing) under qemu (TCG, and KVM on an
> arm64 host) and on GB200-based arm64 hardware: fentry/fexit, fmod_ret
> and BPF LSM programs load, attach and execute; the ftrace-direct
> sample modules (including both modify samples, exercising
> ftrace_modify_call()) run cleanly; no CFI violations observed. The
> fentry_test, fexit_test, fentry_fexit, fexit_sleep, fexit_stress,
> modify_return, tracing_struct, lsm and trampoline_count selftests and
> the ftrace direct-call selftests (test.d/direct) pass on the new
> configuration with results identical to a CALL_OPS kernel built from
> the same tree, and a broader test_progs sweep showed no differences
> attributable to this series. Without the series, all of the above
> fail at attach time with -ENOTSUPP.
>
> riscv has the same gap (its DIRECT_CALLS select also requires
> CALL_OPS, and its CALL_OPS is likewise !CFI); if this approach is
> acceptable for arm64 we can follow up there.
>
> ---
> Jose Fernandez (Anthropic) (2):
>       arm64: ftrace: prepare ftrace_modify_call() for use without CALL_OPS
>       arm64: ftrace: allow DIRECT_CALLS without CALL_OPS
>
>  arch/arm64/Kconfig         | 2 +-
>  arch/arm64/kernel/ftrace.c | 5 +++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260607-arm64-ftrace-direct-calls-152230ef7077
>
> Best regards,
> --
> Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>

Thanks for fixing this! I gave it a try on an aarch64 laptop (macbook m2)
using kernel 7.0.11 with CFI=y && DEBUG_INFO_BTF=y, built with clang. AFAIK it
seems to work great, I am able to finally run systemd with BPF support (e.g.,
unprivileged nspawn works now)

Tested-by: Clayton Craft <craftyguy@postmarketos.org>


^ permalink raw reply

* [PATCH] task_work: add tracepoints for task_work callbacks
From: Imran Khan @ 2026-06-10  4:14 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, peterz
  Cc: linux-trace-kernel, linux-kernel

task_work tracepoints can be enabled by:

    echo 1 > /sys/kernel/tracing/events/task_work/enable

and trace logs would look like:

    ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a508 func=____fput notify=TWA_RESUME
    ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a508 ret=0
    ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a5c8 func=____fput notify=TWA_RESUME
    ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a5c8 ret=0
    ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a688 func=____fput notify=TWA_RESUME
    ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a688 ret=0
    ... task_work_add_request: target_comm=ls target_pid=227 work=ffff95d20641a748 func=____fput notify=TWA_RESUME
    ... task_work_add_done: target_comm=ls target_pid=227 work=ffff95d20641a748 ret=0
    ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a748 func=____fput
    ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a748 func=____fput
    ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a688 func=____fput
    ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a688 func=____fput
    ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a5c8 func=____fput
    ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a5c8 func=____fput
    ... task_work_run_start: comm=ls pid=227 work=ffff95d20641a508 func=____fput
    ... task_work_run_end: comm=ls pid=227 work=ffff95d20641a508 func=____fput

formatted as:
    target_comm=<comm of target task>
    target_pid=<pid of target task>
    work=<callback_head *>
    func=<callback_head->func>
    notify=<way to notify the target task>
    comm=<comm of current task executing func>
    pid=<pid of current task executing func>

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 include/trace/events/task_work.h | 129 +++++++++++++++++++++++++++++++
 kernel/task_work.c               |  32 +++++++-
 2 files changed, 159 insertions(+), 2 deletions(-)
 create mode 100644 include/trace/events/task_work.h

diff --git a/include/trace/events/task_work.h b/include/trace/events/task_work.h
new file mode 100644
index 0000000000000..e43ffd607e7ec
--- /dev/null
+++ b/include/trace/events/task_work.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM task_work
+
+#if !defined(_TRACE_TASK_WORK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TASK_WORK_H
+
+#include <linux/tracepoint.h>
+#include <linux/task_work.h>
+
+TRACE_DEFINE_ENUM(TWA_NONE);
+TRACE_DEFINE_ENUM(TWA_RESUME);
+TRACE_DEFINE_ENUM(TWA_SIGNAL);
+TRACE_DEFINE_ENUM(TWA_SIGNAL_NO_IPI);
+TRACE_DEFINE_ENUM(TWA_NMI_CURRENT);
+
+#define show_task_work_notify_mode(notify)			\
+	__print_symbolic(notify,				\
+		{ TWA_NONE,           "TWA_NONE" },		\
+		{ TWA_RESUME,         "TWA_RESUME" },		\
+		{ TWA_SIGNAL,         "TWA_SIGNAL" },		\
+		{ TWA_SIGNAL_NO_IPI,  "TWA_SIGNAL_NO_IPI" },	\
+		{ TWA_NMI_CURRENT,    "TWA_NMI_CURRENT" })
+
+/*
+ * task_work_add() is split into two events:
+ *
+ *   task_work:add_request - fires before the cmpxchg that enqueues
+ *                           @work. Guaranteed to happen-before any
+ *                           run_start has picked the @work.
+ *   task_work:add_done    - fires after the cmpxchg loop terminates,
+ *                           carrying the final ret value.
+ */
+TRACE_EVENT(task_work_add_request,
+
+	TP_PROTO(struct task_struct *task,
+		 struct callback_head *work,
+		 task_work_func_t func,
+		 enum task_work_notify_mode notify),
+
+	TP_ARGS(task, work, func, notify),
+
+	TP_STRUCT__entry(
+		__field(pid_t,		pid)
+		__array(char,		comm, TASK_COMM_LEN)
+		__field(void *,		work)
+		__field(void *,		func)
+		__field(int,		notify)
+	),
+
+	TP_fast_assign(
+		__entry->pid		= task->pid;
+		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		__entry->work		= work;
+		__entry->func		= func;
+		__entry->notify		= notify;
+	),
+
+	TP_printk("target_comm=%s target_pid=%d work=%p func=%ps notify=%s",
+		__entry->comm, __entry->pid, __entry->work,
+		__entry->func, show_task_work_notify_mode(__entry->notify))
+);
+
+TRACE_EVENT(task_work_add_done,
+
+	TP_PROTO(struct task_struct *task,
+		 struct callback_head *work,
+		 int ret),
+
+	TP_ARGS(task, work, ret),
+
+	TP_STRUCT__entry(
+		__field(pid_t,		pid)
+		__array(char,		comm, TASK_COMM_LEN)
+		__field(void *,		work)
+		__field(int,		ret)
+	),
+
+	TP_fast_assign(
+		__entry->pid		= task->pid;
+		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		__entry->work		= work;
+		__entry->ret		= ret;
+	),
+
+	TP_printk("target_comm=%s target_pid=%d work=%p ret=%d",
+		__entry->comm, __entry->pid, __entry->work, __entry->ret)
+);
+
+DECLARE_EVENT_CLASS(task_work_run_template,
+
+	TP_PROTO(struct task_struct *task,
+		 struct callback_head *work,
+		 task_work_func_t func),
+
+	TP_ARGS(task, work, func),
+
+	TP_STRUCT__entry(
+		__field(pid_t,		pid)
+		__array(char,		comm, TASK_COMM_LEN)
+		__field(void *,		work)
+		__field(void *,		func)
+	),
+
+	TP_fast_assign(
+		__entry->pid		= task->pid;
+		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		__entry->work		= work;
+		__entry->func		= func;
+	),
+
+	TP_printk("comm=%s pid=%d work=%p func=%ps",
+		__entry->comm, __entry->pid, __entry->work, __entry->func)
+);
+
+DEFINE_EVENT(task_work_run_template, task_work_run_start,
+	TP_PROTO(struct task_struct *task, struct callback_head *work,
+		 task_work_func_t func),
+	TP_ARGS(task, work, func));
+
+DEFINE_EVENT(task_work_run_template, task_work_run_end,
+	TP_PROTO(struct task_struct *task, struct callback_head *work,
+		 task_work_func_t func),
+	TP_ARGS(task, work, func));
+
+#endif /* _TRACE_TASK_WORK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 0f7519f8e7c93..ed04a8c7116de 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -4,6 +4,9 @@
 #include <linux/task_work.h>
 #include <linux/resume_user_mode.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/task_work.h>
+
 static struct callback_head work_exited; /* all we need is ->next == NULL */
 
 #ifdef CONFIG_IRQ_WORK
@@ -60,6 +63,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		  enum task_work_notify_mode notify)
 {
 	struct callback_head *head;
+	task_work_func_t func;
 
 	if (notify == TWA_NMI_CURRENT) {
 		if (WARN_ON_ONCE(task != current))
@@ -70,10 +74,25 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		kasan_record_aux_stack(work);
 	}
 
+	/*
+	 * Snapshot work->func before the cmpxchg below publishes @work.
+	 * After publish, a concurrent task_work_run() on @task may invoke
+	 * the callback and free @work, after which dereferencing work->func
+	 * to fill the tracepoint payload would cause UAF error.
+	 */
+	func = work->func;
+
+	/*
+	 * Emit add_request BEFORE the cmpxchg loop.
+	 * Tracing here guarantees add_request is seen before any possible
+	 * run_start.
+	 */
+	trace_task_work_add_request(task, work, func, notify);
+
 	head = READ_ONCE(task->task_works);
 	do {
 		if (unlikely(head == &work_exited))
-			return -ESRCH;
+			goto out_esrch;
 		work->next = head;
 	} while (!try_cmpxchg(&task->task_works, &head, work));
 
@@ -100,7 +119,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		break;
 	}
 
+	trace_task_work_add_done(task, work, 0);
 	return 0;
+
+out_esrch:
+	trace_task_work_add_done(task, work, -ESRCH);
+	return -ESRCH;
 }
 
 /**
@@ -229,8 +253,12 @@ void task_work_run(void)
 		raw_spin_unlock_irq(&task->pi_lock);
 
 		do {
+			task_work_func_t func = work->func;
+
 			next = work->next;
-			work->func(work);
+			trace_task_work_run_start(task, work, func);
+			func(work);
+			trace_task_work_run_end(task, work, func);
 			work = next;
 			cond_resched();
 		} while (work);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
From: sashiko-bot @ 2026-06-10  1:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Petr Mladek, Peter Zijlstra
  Cc: Tengda Wu, Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <20260610013658.1837963-1-wutengda@huaweicloud.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Removing `task_is_running()` from `rethook_find_ret_addr()` exposes stack unwinders to use-after-free, wild pointer dereferences, and list corruption when unwinding running tasks, because `rethook_node` recycling and unwinding lack adequate RCU synchronization.
--

commit 83e1fac9c64377191cc2b41ef9da781c46f1e95a
Author: Tengda Wu <wutengda@huaweicloud.com>

rethook: Remove the running task check in rethook_find_ret_addr()

This commit removes the task_is_running() check from rethook_find_ret_addr().
The commit message states the check is unnecessary because unwinders hold RCU
and rethook_node structures are RCU-freed. Callers are expected to provide
their own safe context.

> The iteration is already safe from crashes because
> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
> even if the iteration goes off the rails and returns invalid information,
> it will not crash. Callers that require consistency must provide a safe
> context themselves.

[ ... ]

> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index 5a8bdf88999a9..1e7fdebe3cd5d 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
[ ... ]
> @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>  	if (WARN_ON_ONCE(!cur))
>  		return 0;
>  
> -	if (tsk != current && task_is_running(tsk))
> -		return 0;
> -
>  	do {
>  		ret = __rethook_find_ret_addr(tsk, cur);

[Severity: High]
This is a pre-existing issue, but does removing this check entirely expose
the stack unwinders to a use-after-free and list corruption?

The commit message mentions that rethook_node structures are RCU-freed, but
looking at rethook_recycle(), the nodes appear to be immediately pushed back
to the lockless objpool without an RCU grace period:

kernel/trace/rethook.c:rethook_recycle() {
    ...
    objpool_push(node, &node->rethook->pool);
}

Additionally, while unwind_next_frame() does hold RCU for a single frame
iteration, stack unwinders like x86 ORC drop the read lock between frames:

arch/x86/kernel/unwind_orc.c:unwind_next_frame() {
    ...
    guard(rcu)();
    ...
}

Because the unwinder retains the list cursor (*cur) across frame iterations
while RCU is dropped, could unwinding a running task allow the target task
to recycle the node?

If the node is recycled and instantly reallocated to another task, the next
call to __rethook_find_ret_addr() would execute:

kernel/trace/rethook.c:__rethook_find_ret_addr() {
    ...
    node = node->next;

on a stale pointer. Could removing this check turn a theoretical race into
an easily triggered use-after-free when reading /proc/<pid>/stack for tasks
executing kretprobes?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda@huaweicloud.com?part=1

^ permalink raw reply

* [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
From: Tengda Wu @ 2026-06-10  1:36 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra, Petr Mladek
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel, live-patching, Tengda Wu

The current check in rethook_find_ret_addr() prevents obtaining a return
address when the target task is marked as running. However, this condition
is both insufficient for correctness and unnecessary for its intended
purpose.

The check is inherently racy: a task can begin running on another CPU
immediately after task_is_running() returns false, potentially leading to
concurrent modification of rethook data structures while the iteration is
in progress.

Rather than trying to fix this unreliable check deep in the unwinding
path, simply remove it. The iteration is already safe from crashes because
unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
even if the iteration goes off the rails and returns invalid information,
it will not crash. Callers that require consistency must provide a safe
context themselves.

Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
---
v4: Also update the function description in the comment.
v3: https://lore.kernel.org/all/20260609084953.901576-1-wutengda@huaweicloud.com/
v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@huaweicloud.com/
v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/

 kernel/trace/rethook.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index 5a8bdf88999a..1e7fdebe3cd5 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -233,9 +233,10 @@ NOKPROBE_SYMBOL(__rethook_find_ret_addr);
  *
  * Find the correct return address modified by a rethook on @tsk in unsigned
  * long type.
- * The @tsk must be 'current' or a task which is not running. @frame is a hint
- * to get the currect return address - which is compared with the
- * rethook::frame field. The @cur is a loop cursor for searching the
+ * @tsk can be any task (any state). If not 'current', the result may be
+ * unreliable. Callers requiring reliability must ensure a safe context.
+ * @frame is a hint to get the correct return address - which is compared with
+ * the rethook::frame field. The @cur is a loop cursor for searching the
  * kretprobe return addresses on the @tsk. The '*@cur' should be NULL at the
  * first call, but '@cur' itself must NOT NULL.
  *
@@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
 	if (WARN_ON_ONCE(!cur))
 		return 0;
 
-	if (tsk != current && task_is_running(tsk))
-		return 0;
-
 	do {
 		ret = __rethook_find_ret_addr(tsk, cur);
 		if (!ret)
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: Shakeel Butt @ 2026-06-10  1:21 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Barry Song, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <d676cf78-4b0a-4d6b-b322-f64826aaca75@linux.dev>

On Tue, Jun 09, 2026 at 05:16:15PM -0700, JP Kobryn wrote:
> On 6/9/26 5:07 PM, JP Kobryn wrote:
> > On 6/9/26 12:44 AM, Barry Song wrote:
> >> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
> >>>
> >>> LRU add batches can be drained before they reach capacity. This can be a
> >>> source of LRU lock contention, but it is not currently possible to
> >>> attribute these drains to callers with existing tracepoints.
> >>>
> >>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
> >>> lru_add batch is drained. This allows tracing to distinguish full drains
> >>> from partial drains and attribute them to the calling stack.
> >>>
> >>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
> >>> per-CPU drain work. This captures the requester stack and target CPU for
> >>> remote drain work. The event is named as a drain-all queue event because
> >>> the queued work can be needed for batches other than lru_add.
> >>>
> >>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
> >>> ---
> >>>  include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
> >>>  mm/swap.c                      |  6 ++++-
> >>>  2 files changed, 45 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
> >>> index 171524d3526d..ea8fc46bedb0 100644
> >>> --- a/include/trace/events/pagemap.h
> >>> +++ b/include/trace/events/pagemap.h
> >>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
> >>>         TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
> >>>  );
> >>>
> >>> +TRACE_EVENT(mm_lru_add_drain,
> >>> +
> >>> +       TP_PROTO(int cpu, unsigned int nr),
> >>> +
> >>> +       TP_ARGS(cpu, nr),
> >>> +
> >>> +       TP_STRUCT__entry(
> >>> +               __field(int,            cpu     )
> >>> +               __field(unsigned int,   nr      )
> >>> +       ),
> >>> +
> >>> +       TP_fast_assign(
> >>> +               __entry->cpu    = cpu;
> >>> +               __entry->nr     = nr;
> >>> +       ),
> >>> +
> >>> +       TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
> >>> +);
> >>> +
> >>> +TRACE_EVENT(mm_lru_drain_all_queue,
> >>> +
> >>> +       TP_PROTO(int target_cpu, bool force_all_cpus),
> >>> +
> >>> +       TP_ARGS(target_cpu, force_all_cpus),
> >>> +
> >>> +       TP_STRUCT__entry(
> >>> +               __field(int,    target_cpu      )
> >>> +               __field(bool,   force_all_cpus  )
> >>> +       ),
> >>> +
> >>> +       TP_fast_assign(
> >>> +               __entry->target_cpu     = target_cpu;
> >>> +               __entry->force_all_cpus = force_all_cpus;
> >>> +       ),
> >>> +
> >>> +       TP_printk("target_cpu=%d force_all_cpus=%s",
> >>> +               __entry->target_cpu,
> >>> +               __entry->force_all_cpus ? "true" : "false")
> >>> +);
> >>> +
> >>>  #endif /* _TRACE_PAGEMAP_H */
> >>>
> >>>  /* This part must be outside protection */
> >>> diff --git a/mm/swap.c b/mm/swap.c
> >>> index 588f50d8f1a8..c385b93582eb 100644
> >>> --- a/mm/swap.c
> >>> +++ b/mm/swap.c
> >>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
> >>>  {
> >>>         struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> >>>         struct folio_batch *fbatch = &fbatches->lru_add;
> >>> +       unsigned int nr_folios_add = folio_batch_count(fbatch);
> >>>
> >>> -       if (folio_batch_count(fbatch))
> >>> +       if (nr_folios_add) {
> >>>                 folio_batch_move_lru(fbatch, lru_add);
> >>> +               trace_mm_lru_add_drain(cpu, nr_folios_add);
> >>> +       }
> >>>
> >>>         fbatch = &fbatches->lru_move_tail;
> >>>         /* Disabling interrupts below acts as a compiler barrier. */
> >>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
> >>>                 if (cpu_needs_drain(cpu)) {
> >>>                         INIT_WORK(work, lru_add_drain_per_cpu);
> >>>                         queue_work_on(cpu, mm_percpu_wq, work);
> >>> +                       trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
> >>
> >> Do you need tracing on each CPU individually, or is tracing the
> >> entire __lru_add_drain_all() invocation sufficient?
> > 
> > I think the latter would be fine. The remote work will invoke the
> > mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
> > the event already has the CPU, we could see where queued drains actually
> > ran.
> 
> Actually if it's just a single invocation and the only event data is the
> force flag, a tracepoint may not even be needed. Other probes can be
> installed on function invocation and read the single argument. I can
> drop this from v2 and keep the single mm_lru_add_drain tracepoint.

No we do want to trace the callers requesting to drain from all the CPUs. If you
trace just lru_add_drain_cpu() then you will only see that the drain is
requested for a given CPU but no information on the requester. 

Also as Barry said, I think single trace for whole __lru_add_drain_all() is good
enough.

^ permalink raw reply

* [RFC PATCH v2 7/7] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-10  0:52 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.

Add a new ftrace kselftest and update the trace event sample module
to test and validate these features.

Specifically, update the trace-events-sample module to set up a
periodic timer whose callback accesses a per-CPU counter. Introduce
a new sample trace event, foo_timer_fn, to trace this callback
and log the current counter value.

Then, add a new test case, btf_probe_event.tc, which defines a
dynamic probe on the timer callback. The probe uses BTF typecasting
to recover the parent structure from the timer argument and
this_cpu_read() to fetch the per-CPU counter. The test verifies
the integrity of the implementation by ensuring the values
recorded by the dynamic probe match those from the static tracepoint.

Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Use timer_shutdown_sync() instead of timer_delete_sync() for teardown.
---
 samples/trace_events/trace-events-sample.c         |   40 +++++++++++++++-
 samples/trace_events/trace-events-sample.h         |   34 ++++++++++++-
 .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 ++++++++++++++++++++
 3 files changed, 120 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc

diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index b61766864b54..2a1f73533a38 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -93,6 +93,20 @@ static int simple_thread_fn(void *arg)
 
 static DEFINE_MUTEX(thread_mutex);
 
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+	struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+	get_cpu();
+	trace_foo_timer_fn(data);
+	(*this_cpu_ptr(data->counter))++;
+	put_cpu();
+
+	mod_timer(t, jiffies + HZ);
+}
+
 int foo_bar_reg(void)
 {
 	mutex_lock(&thread_mutex);
@@ -124,9 +138,27 @@ void foo_bar_unreg(void)
 
 static int __init trace_event_init(void)
 {
+	foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+	if (!foo_timer_data)
+		return -ENOMEM;
+
+	foo_timer_data->name = "sample_timer_counter";
+	foo_timer_data->counter = alloc_percpu(int);
+	if (!foo_timer_data->counter) {
+		kfree(foo_timer_data);
+		return -ENOMEM;
+	}
+
+	timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+	mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
 	simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
-	if (IS_ERR(simple_tsk))
-		return -1;
+	if (IS_ERR(simple_tsk)) {
+		timer_shutdown_sync(&foo_timer_data->timer);
+		free_percpu(foo_timer_data->counter);
+		kfree(foo_timer_data);
+		return PTR_ERR(simple_tsk);
+	}
 
 	return 0;
 }
@@ -139,6 +171,10 @@ static void __exit trace_event_exit(void)
 		kthread_stop(simple_tsk_fn);
 	simple_tsk_fn = NULL;
 	mutex_unlock(&thread_mutex);
+
+	timer_shutdown_sync(&foo_timer_data->timer);
+	free_percpu(foo_timer_data->counter);
+	kfree(foo_timer_data);
 }
 
 module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
  */
 
 /*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
  */
 #ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
 #define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
 static inline int __length_of(const int *list)
 {
 	int i;
@@ -270,6 +272,13 @@ enum {
 	TRACE_SAMPLE_BAR = 4,
 	TRACE_SAMPLE_ZOO = 8,
 };
+
+struct foo_timer_data {
+	const char		*name;
+	struct timer_list	timer;
+	int __percpu		*counter;
+};
+
 #endif
 
 /*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
 		  __get_rel_bitmask(bitmask),
 		  __get_rel_cpumask(cpumask))
 );
+
+TRACE_EVENT(foo_timer_fn,
+
+	TP_PROTO(struct foo_timer_data *data),
+
+	TP_ARGS(data),
+
+	TP_STRUCT__entry(
+		__string(	name,			data->name	)
+		__field(	int,			count		)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->count	= *this_cpu_ptr(data->counter);
+	),
+
+	TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
 #endif
 
 /***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..96791e120b7d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+  modprobe trace-events-sample || exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and this_cpu_read() access.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# this_cpu_read(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+  if echo $line | grep -q "foo_timer_fn:"; then
+    NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+    COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+    if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+       MATCH=$((MATCH+1))
+    fi
+  fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+  echo "No matching events found"
+  exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace


^ permalink raw reply related

* [RFC PATCH v2 6/7] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu (Google) @ 2026-06-10  0:52 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When tracing the kernel local variables, sometimes we need to get the
CPU local variables. To access it, current simple dereference is not
enough.

Thus, introduce a special this_cpu_read() dereference to access per-cpu
variable for the current CPU (accessing other CPU variable may race with
updates on other CPUs). Also this_cpu_ptr() is for accessing per-cpu
pointer.

Those are working as same as the kernel percpu macro.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Drop +CPU/+PCPU and introduce this_cpu_read() and this_cpu_ptr().
  - Support these method with BTF typecast.
  - Just check the base address is NOT NULL instead of is_kernel_percpu_address().
---
 Documentation/trace/eprobetrace.rst |    2 +
 Documentation/trace/fprobetrace.rst |    2 +
 Documentation/trace/kprobetrace.rst |    2 +
 kernel/trace/trace.c                |    1 
 kernel/trace/trace_probe.c          |  135 ++++++++++++++++++++++++-----------
 kernel/trace/trace_probe.h          |    2 +
 kernel/trace/trace_probe_tmpl.h     |   30 ++++++--
 7 files changed, 125 insertions(+), 49 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index dcf92d5b4175..6ba70327c1de 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -40,6 +40,8 @@ Synopsis of eprobe_events
   $comm		: Fetch current task comm.
   $current	: Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+  this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 3392cab016b3..3439bc9bd351 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -52,6 +52,8 @@ Synopsis of fprobe-events
   $comm         : Fetch current task comm.
   $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+  this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+  this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
   \IMM          : Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 81e4fe38791d..9ae330eb0a52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,6 +55,8 @@ Synopsis of kprobe_events
   $comm		: Fetch current task comm.
   $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+  this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e185a006cb08..1d5d6e46dc4d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4332,6 +4332,7 @@ static const char readme_msg[] =
 	"\t           $stack<index>, $stack, $retval, $comm, $current\n"
 #endif
 	"\t           +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
+	"\t           this_cpu_read(<fetcharg>), this_cpu_ptr(<fetcharg>)\n"
 	"\t     kernel return probes support: $retval, $arg<N>, $comm\n"
 	"\t     type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
 	"\t           b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 4bdccd9bd7d1..37ada81b7d46 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -349,6 +349,77 @@ static int parse_trace_event(char *arg, struct fetch_insn *code,
 	return -EINVAL;
 }
 
+/* this_cpu_* parser */
+#define THIS_CPU_PTR_PREFIX "this_cpu_ptr("
+#define THIS_CPU_READ_PREFIX "this_cpu_read("
+#define THIS_CPU_PTR_LEN (sizeof(THIS_CPU_PTR_PREFIX) - 1)
+#define THIS_CPU_READ_LEN (sizeof(THIS_CPU_READ_PREFIX) - 1)
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+		struct fetch_insn **pcode, struct fetch_insn *end,
+		struct traceprobe_parse_context *ctx);
+
+/* handle dereference nested call */
+static inline int handle_dereference(char *arg, struct fetch_insn **pcode,
+	struct fetch_insn *end, struct traceprobe_parse_context *ctx,
+	int deref, long offset)
+{
+	const struct fetch_type *type = find_fetch_type(NULL, ctx->flags);
+	struct fetch_insn *code = *pcode;
+	int cur_offs = ctx->offset;
+	char *tmp;
+	int ret;
+
+	tmp = strrchr(arg, ')');
+	if (!tmp) {
+		trace_probe_log_err(ctx->offset + strlen(arg),
+					DEREF_OPEN_BRACE);
+		return -EINVAL;
+	}
+
+	*tmp = '\0';
+	ret = parse_probe_arg(arg, type, &code, end, ctx);
+	if (ret)
+		return ret;
+	ctx->offset = cur_offs;
+	if (code->op == FETCH_OP_COMM || code->op == FETCH_OP_DATA) {
+		trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
+		return -EINVAL;
+	}
+	code++;
+	if (code == end) {
+		trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+		return -EINVAL;
+	}
+	*pcode = code;
+
+	code->op = deref;
+	code->offset = offset;
+	/* Reset the last type if used */
+	ctx->last_type = NULL;
+	return 0;
+}
+
+static int parse_this_cpu(char *arg, struct fetch_insn **pcode,
+			  struct fetch_insn *end,
+			  struct traceprobe_parse_context *ctx)
+{
+	int deref;
+
+	if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX)) {
+		arg += THIS_CPU_PTR_LEN;
+		ctx->offset += THIS_CPU_PTR_LEN;
+		deref = FETCH_OP_CPU_PTR;
+	} else if (str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+		arg += THIS_CPU_READ_LEN;
+		ctx->offset += THIS_CPU_READ_LEN;
+		deref = FETCH_OP_DEREF_CPU;
+	} else
+		return -EINVAL;
+	return handle_dereference(arg, pcode, end, ctx, deref, 0);
+}
+
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 
 static u32 btf_type_int(const struct btf_type *t)
@@ -928,11 +999,6 @@ static char *find_matched_close_paren(char *s)
 	return NULL;
 }
 
-static int
-parse_probe_arg(char *arg, const struct fetch_type *type,
-		struct fetch_insn **pcode, struct fetch_insn *end,
-		struct traceprobe_parse_context *ctx);
-
 static int handle_typecast(char *arg, struct fetch_insn **pcode,
 			   struct fetch_insn *end,
 			   struct traceprobe_parse_context *ctx)
@@ -958,7 +1024,8 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 	*tmp++ = '\0';
 
 	/* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
-	if (*tmp == '(') {
+	if (*tmp == '(' || str_has_prefix(tmp, THIS_CPU_PTR_PREFIX) ||
+	    str_has_prefix(tmp, THIS_CPU_READ_PREFIX)) {
 		char *close = find_matched_close_paren(tmp);
 
 		ctx->offset += tmp - arg;
@@ -978,12 +1045,18 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 			trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
 			return -E2BIG;
 		}
-		*close = '\0';
 
-		ctx->offset += 1;	/* for the '(' */
-		/* We need to parse the nested one */
-		ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
-				pcode, end, ctx);
+		if (*tmp == '(') {
+			/* Extract the inner argument */
+			*close = '\0';
+			ctx->offset += 1;/* for the '(' */
+			/* Parse the nested one */
+			ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+					pcode, end, ctx);
+		} else {
+			/* this_cpu_* will be parsed in parse_this_cpu() */
+			ret = parse_this_cpu(tmp, pcode, end, ctx);
+		}
 		if (ret < 0)
 			return ret;
 		ctx->nested_level--;
@@ -1448,36 +1521,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 		}
 		ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
 		arg = tmp + 1;
-		tmp = strrchr(arg, ')');
-		if (!tmp) {
-			trace_probe_log_err(ctx->offset + strlen(arg),
-					    DEREF_OPEN_BRACE);
-			return -EINVAL;
-		} else {
-			const struct fetch_type *t2 = find_fetch_type(NULL, ctx->flags);
-			int cur_offs = ctx->offset;
-
-			*tmp = '\0';
-			ret = parse_probe_arg(arg, t2, &code, end, ctx);
-			if (ret)
-				break;
-			ctx->offset = cur_offs;
-			if (code->op == FETCH_OP_COMM ||
-			    code->op == FETCH_OP_DATA) {
-				trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
-				return -EINVAL;
-			}
-			if (++code == end) {
-				trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
-				return -EINVAL;
-			}
-			*pcode = code;
-
-			code->op = deref;
-			code->offset = offset;
-			/* Reset the last type if used */
-			ctx->last_type = NULL;
-		}
+		ret = handle_dereference(arg, pcode, end, ctx, deref, offset);
+		if (ret < 0)
+			return ret;
 		break;
 	case '\\':	/* Immediate value */
 		if (arg[1] == '"') {	/* Immediate string */
@@ -1498,15 +1544,18 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 		ret = handle_typecast(arg, pcode, end, ctx);
 		break;
 	default:
-		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
+		if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX) ||
+		    str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+			ret = parse_this_cpu(arg, pcode, end, ctx);
+		} else if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
 			if (!tparg_is_function_entry(ctx->flags) &&
 			    !tparg_is_function_return(ctx->flags)) {
 				trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
 				return -EINVAL;
 			}
 			ret = parse_btf_arg(arg, pcode, end, ctx);
-			break;
 		}
+		break;
 	}
 	if (!ret && code->op == FETCH_OP_NOP) {
 		/* Parsed, but do not find fetch method */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 62645e847bd1..33cec2b19041 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -100,6 +100,8 @@ enum fetch_op {
 	// Stage 2 (dereference) op
 	FETCH_OP_DEREF,		/* Dereference: .offset */
 	FETCH_OP_UDEREF,	/* User-space Dereference: .offset */
+	FETCH_OP_DEREF_CPU,	/* Per-CPU Dereference for this CPU */
+	FETCH_OP_CPU_PTR,	/* Per-CPU pointer for this CPU */
 	// Stage 3 (store) ops
 	FETCH_OP_ST_RAW,	/* Raw: .size */
 	FETCH_OP_ST_MEM,	/* Mem: .offset, .size */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f630930288d2..581aa38c66af 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -129,25 +129,43 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
 	struct fetch_insn *s3 = NULL;
 	int total = 0, ret = 0, i = 0;
 	u32 loc = 0;
-	unsigned long lval = val;
+	unsigned long lval, llval = val;
 
 stage2:
 	/* 2nd stage: dereference memory if needed */
 	do {
-		if (code->op == FETCH_OP_DEREF) {
-			lval = val;
+		lval = val;
+		switch (code->op) {
+		case FETCH_OP_DEREF:
 			ret = probe_mem_read(&val, (void *)val + code->offset,
 					     sizeof(val));
-		} else if (code->op == FETCH_OP_UDEREF) {
-			lval = val;
+			break;
+		case FETCH_OP_UDEREF:
 			ret = probe_mem_read_user(&val,
 				 (void *)val + code->offset, sizeof(val));
-		} else
 			break;
+		case FETCH_OP_DEREF_CPU:
+		case FETCH_OP_CPU_PTR:
+			if (unlikely(!val)) {
+				ret = -EFAULT;
+				break;
+			}
+			val = (unsigned long)this_cpu_ptr((void __percpu *)val);
+			if (code->op == FETCH_OP_DEREF_CPU)
+				ret = probe_mem_read(&val, (void *)val, sizeof(val));
+			else
+				ret = 0;
+			break;
+		default:
+			lval = llval;
+			goto out;
+		}
 		if (ret)
 			return ret;
+		llval = lval;
 		code++;
 	} while (1);
+out:
 
 	s3 = code;
 stage3:


^ permalink raw reply related

* [RFC PATCH v2 5/7] tracing/probes: Add $current variable support
From: Masami Hiramatsu (Google) @ 2026-06-10  0:52 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since we can use the BTF to cast value to a structure pointer type,
it is useful to introduce "$current" special variable support to
fetcharg.

User can define a fetcharg to access current task_struct properties
using BTF info. e.g.

  $current->cpus_ptr

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
  Changes in v2:
   - Support to parse $current in parse_btf_arg().
   - If no typecast on $current, it automatically casted to task_struct.
   - Check error case if $current follows something except for "-".
---
 Documentation/trace/eprobetrace.rst |    1 +
 Documentation/trace/fprobetrace.rst |    1 +
 Documentation/trace/kprobetrace.rst |    1 +
 kernel/trace/trace.c                |    2 +-
 kernel/trace/trace_probe.c          |   29 ++++++++++++++++++++++++++++-
 kernel/trace/trace_probe.h          |    1 +
 kernel/trace/trace_probe_tmpl.h     |    3 +++
 7 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 680e0af43d5d..dcf92d5b4175 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -38,6 +38,7 @@ Synopsis of eprobe_events
   @ADDR		: Fetch memory at ADDR (ADDR should be in kernel)
   @SYM[+|-offs]	: Fetch memory at SYM +|- offs (SYM should be a data symbol)
   $comm		: Fetch current task comm.
+  $current	: Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 290a9e6f7491..3392cab016b3 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -50,6 +50,7 @@ Synopsis of fprobe-events
   $argN         : Fetch the Nth function argument. (N >= 1) (\*2)
   $retval       : Fetch return value.(\*3)
   $comm         : Fetch current task comm.
+  $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
   \IMM          : Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index a62707e6a9f2..81e4fe38791d 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -53,6 +53,7 @@ Synopsis of kprobe_events
   $argN		: Fetch the Nth function argument. (N >= 1) (\*1)
   $retval	: Fetch return value.(\*2)
   $comm		: Fetch current task comm.
+  $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0e36af853199..e185a006cb08 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4329,7 +4329,7 @@ static const char readme_msg[] =
 	"\t           [(structname[,field])](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
-	"\t           $stack<index>, $stack, $retval, $comm,\n"
+	"\t           $stack<index>, $stack, $retval, $comm, $current\n"
 #endif
 	"\t           +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
 	"\t     kernel return probes support: $retval, $arg<N>, $comm\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 726be9782775..4bdccd9bd7d1 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -718,6 +718,20 @@ static int parse_btf_arg(char *varname,
 		return -EOPNOTSUPP;
 	}
 
+	if (strcmp(varname, "$current") == 0) {
+		code->op = FETCH_OP_CURRENT;
+		/* If no typecast is specified for $current, use task_struct by default */
+		if (!ctx->struct_btf) {
+			tid = bpf_find_btf_id("task_struct", BTF_KIND_STRUCT, &ctx->struct_btf);
+			if (tid < 0) {
+				trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);
+				return -ENOENT;
+			}
+			ctx->last_struct = btf_type_skip_modifiers(ctx->struct_btf, tid, &tid);
+		}
+		goto found;
+	}
+
 	if (ctx->flags & TPARG_FL_TEVENT) {
 		ret = parse_trace_event(varname, code, ctx);
 		if (ret < 0) {
@@ -756,8 +770,8 @@ static int parse_btf_arg(char *varname,
 			return -ENOENT;
 		}
 	}
-	params = ctx->params;
 
+	params = ctx->params;
 	for (i = 0; i < ctx->nr_params; i++) {
 		const char *name = btf_name_by_offset(ctx->btf, params[i].name_off);
 
@@ -1246,6 +1260,19 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 		return 0;
 	}
 
+	/* $current returns the address of the current task_struct. */
+	if (str_has_prefix(arg, "current")) {
+		arg += strlen("current");
+		if (*arg == '-' && IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS))
+			return parse_btf_arg(orig_arg, pcode, end, ctx);
+
+		if (*arg != '\0')
+			goto inval;
+
+		code->op = FETCH_OP_CURRENT;
+		return 0;
+	}
+
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	len = str_has_prefix(arg, "arg");
 	if (len) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 44f113faae61..62645e847bd1 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -96,6 +96,7 @@ enum fetch_op {
 	FETCH_OP_FOFFS,		/* File offset: .immediate */
 	FETCH_OP_DATA,		/* Allocated data: .data */
 	FETCH_OP_EDATA,		/* Entry data: .offset */
+	FETCH_OP_CURRENT,	/* Current task_struct address */
 	// Stage 2 (dereference) op
 	FETCH_OP_DEREF,		/* Dereference: .offset */
 	FETCH_OP_UDEREF,	/* User-space Dereference: .offset */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f39b37fcdb3b..f630930288d2 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -112,6 +112,9 @@ process_common_fetch_insn(struct fetch_insn *code, unsigned long *val)
 	case FETCH_OP_DATA:
 		*val = (unsigned long)code->data;
 		break;
+	case FETCH_OP_CURRENT:
+		*val = (unsigned long)current;
+		break;
 	default:
 		return -EILSEQ;
 	}


^ permalink raw reply related

* [RFC PATCH v2 4/7] tracing/probes: Support field specifier option for typecast
From: Masami Hiramatsu (Google) @ 2026-06-10  0:52 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a field specifier option for the typecast. This works like
container_of() macro.

    (STRUCT[,FIELD[.FIELD2...]])VAR

This is equivalent to :

    container_of(VAR, struct STRUCT, FIELD[.FIELD2...])

For example:

 echo "f tick_nohz_handler next_tick=(tick_sched,sched_timer)timer->next_tick" >> dynamic_events

This will trace tick_nohz_handler() with its tick_sched::next_tick which
is converted from @timer by contianer_of(tick, struct tick_sched, sched_timer).
So, if you enabkle both fprobes:tick_nohz_handler__entry and
timer:hrtimer_expire_entry events, we will see something like:


          <idle>-0       [002] d.h1.  3778.087272: hrtimer_expire_entry: hrtimer=00000000d63db328 f
unction=tick_nohz_handler now=3777450051040
          <idle>-0       [002] d.h1.  3778.087281: tick_nohz_handler__entry: (tick_nohz_handler+0x4
/0x140) next_tick=3777450000000


Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Use byteoffset for typecast field offset instead of bitoffset. This fixes negative modulo calculation.
  - Check whether a field is specified after typecast.
  - Reject if typecast field option  has arrow operator.
---
 Documentation/trace/eprobetrace.rst |    5 +
 Documentation/trace/fprobetrace.rst |    8 +-
 Documentation/trace/kprobetrace.rst |    8 +-
 kernel/trace/trace.c                |    4 -
 kernel/trace/trace_probe.c          |  178 ++++++++++++++++++++++++-----------
 kernel/trace/trace_probe.h          |    5 +
 6 files changed, 141 insertions(+), 67 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index cd0b4aa7f896..680e0af43d5d 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -49,7 +49,10 @@ Synopsis of eprobe_events
   (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that when this is used, the FIELD name does not
-                  need to be prefixed with a '$'.
+                  need to be prefixed with a '$'. ASGN can be specified optionally.
+		  If ASGN is specified, FIELD will be cast to the same offset
+		  position as the ASGN member, rather than to the beginning of
+		  the STRUCT.
   (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
 		  also be used with another FETCHARG instead of FIELD.
 
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 6b8bb27bb62d..290a9e6f7491 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,10 +57,12 @@ Synopsis of fprobe-events
                   (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                   (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                   and bitfield are supported.
-  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+  (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
-                  ->MEMBER.
-  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                  ->MEMBER. ASGN can be specified optionally. If ASGN is specified,
+		  FIELD will be cast to the same offset position as the ASGN member,
+		  rather than to the beginning of the STRUCT.
+  (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
                  also be used with another FETCHARG instead of FIELD.
 
   (\*1) This is available only when BTF is enabled.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index c4382765d5b2..a62707e6a9f2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,11 +61,13 @@ Synopsis of kprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and bitfield are
                   supported.
-  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+  (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that this is available only when the probe is
-		   on function entry.
-  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+		   on function entry. ASGN can be specified optionally. If ASGN
+		   is specified, FIELD will be cast to the same offset position
+		   as the ASGN member, rather than to the beginning of the STRUCT.
+  (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
                  also be used with another FETCHARG instead of FIELD.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4f70318918c2..0e36af853199 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,8 +4325,8 @@ static const char readme_msg[] =
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
-	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
-	"\t           [(structname)](fetcharg)->field[->field|.field...],\n"
+	"\t           [(structname[,field])]<argname>[->field[->field|.field...]],\n"
+	"\t           [(structname[,field])](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index dba73aaa8ade..726be9782775 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -574,6 +574,65 @@ static int split_next_field(char *varname, char **next_field,
 	return ret;
 }
 
+/* Inner loop for solving dot operator ('.'). Return bit-offset of the given field */
+static int get_bitoffset_of_field(char **pfieldname, const struct btf_type **ptype,
+				  struct traceprobe_parse_context *ctx)
+{
+	const struct btf_type *type = *ptype;
+	const struct btf_member *field;
+	struct btf *btf = ctx_btf(ctx);
+	char *fieldname = *pfieldname;
+	int bitoffs = 0;
+	u32 anon_offs;
+	char *next;
+	int is_ptr;
+	s32 tid;
+
+	do {
+		next = NULL;
+		is_ptr = split_next_field(fieldname, &next, ctx);
+		if (is_ptr < 0)
+			return is_ptr;
+
+		anon_offs = 0;
+		field = btf_find_struct_member(btf, type, fieldname,
+						&anon_offs);
+		if (IS_ERR(field)) {
+			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+			return PTR_ERR(field);
+		}
+		if (!field) {
+			trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
+			return -ENOENT;
+		}
+		/* Add anonymous structure/union offset */
+		bitoffs += anon_offs;
+
+		/* Accumulate the bit-offsets of the dot-connected fields */
+		if (btf_type_kflag(type)) {
+			bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
+			ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
+		} else {
+			bitoffs += field->offset;
+			ctx->last_bitsize = 0;
+		}
+
+		type = btf_type_skip_modifiers(btf, field->type, &tid);
+		if (!type) {
+			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+			return -EINVAL;
+		}
+
+		if (next)
+			ctx->offset += next - fieldname;
+		fieldname = next;
+	} while (!is_ptr && fieldname);
+
+	*pfieldname = fieldname;
+	*ptype = type;
+
+	return bitoffs;
+}
 /*
  * Parse the field of data structure. The @type must be a pointer type
  * pointing the target data structure type.
@@ -583,16 +642,14 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 			   struct traceprobe_parse_context *ctx)
 {
 	struct fetch_insn *code = *pcode;
-	const struct btf_member *field;
-	u32 bitoffs, anon_offs;
-	bool is_struct = ctx->struct_btf != NULL;
 	struct btf *btf = ctx_btf(ctx);
-	char *next;
-	int is_ptr;
+	bool is_first_field = true;
+	int bitoffs;
 	s32 tid;
 
 	do {
-		if (!is_struct) {
+		/* For the first field of typecast, @type will be the target structure type. */
+		if (!(is_first_field && ctx->struct_btf)) {
 			/* Outer loop for solving arrow operator ('->') */
 			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
 				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
@@ -606,60 +663,25 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				return -EINVAL;
 			}
 		}
-		/* Only the first type can skip being a pointer */
-		is_struct = false;
-
-		bitoffs = 0;
-		do {
-			/* Inner loop for solving dot operator ('.') */
-			next = NULL;
-			is_ptr = split_next_field(fieldname, &next, ctx);
-			if (is_ptr < 0)
-				return is_ptr;
-
-			anon_offs = 0;
-			field = btf_find_struct_member(btf, type, fieldname,
-						       &anon_offs);
-			if (IS_ERR(field)) {
-				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-				return PTR_ERR(field);
-			}
-			if (!field) {
-				trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
-				return -ENOENT;
-			}
-			/* Add anonymous structure/union offset */
-			bitoffs += anon_offs;
-
-			/* Accumulate the bit-offsets of the dot-connected fields */
-			if (btf_type_kflag(type)) {
-				bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
-				ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
-			} else {
-				bitoffs += field->offset;
-				ctx->last_bitsize = 0;
-			}
-
-			type = btf_type_skip_modifiers(btf, field->type, &tid);
-			if (!type) {
-				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-				return -EINVAL;
-			}
-
-			ctx->offset += next - fieldname;
-			fieldname = next;
-		} while (!is_ptr && fieldname);
 
+		bitoffs = get_bitoffset_of_field(&fieldname, &type, ctx);
+		if (bitoffs < 0)
+			return bitoffs;
 		if (++code == end) {
 			trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
 			return -EINVAL;
 		}
 		code->op = FETCH_OP_DEREF;	/* TODO: user deref support */
 		code->offset = bitoffs / 8;
+		if (is_first_field && ctx->struct_btf) {
+			/* The first field can be typecasted with field option. */
+			code->offset -= ctx->prefix_byteoffs;
+		}
 		*pcode = code;
 
 		ctx->last_bitoffs = bitoffs % 8;
 		ctx->last_type = type;
+		is_first_field = false;
 	} while (fieldname);
 
 	return 0;
@@ -690,6 +712,11 @@ static int parse_btf_arg(char *varname,
 				    NOSUP_DAT_ARG);
 		return -EOPNOTSUPP;
 	}
+	if (!field && ctx->struct_btf) {
+		/* Typecast without field option is not supported */
+		trace_probe_log_err(ctx->offset, TYPECAST_REQ_FIELD);
+		return -EOPNOTSUPP;
+	}
 
 	if (ctx->flags & TPARG_FL_TEVENT) {
 		ret = parse_trace_event(varname, code, ctx);
@@ -700,8 +727,7 @@ static int parse_btf_arg(char *varname,
 		/* TEVENT is only here via a typecast */
 		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
 			return -EINVAL;
-		type = ctx->last_struct;
-		goto found_type;
+		goto found;
 	}
 
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
@@ -763,7 +789,6 @@ static int parse_btf_arg(char *varname,
 		type = ctx->last_struct;
 	else
 		type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
-found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
@@ -832,6 +857,45 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
 	return 0;
 }
 
+static int parse_btf_casttype(char *casttype, struct traceprobe_parse_context *ctx)
+{
+	char *field;
+	int ret;
+
+	/* Field option - evaluated later. */
+	field = strchr(casttype, ',');
+	if (field)
+		*field++ = '\0';
+
+	ret = query_btf_struct(casttype, ctx);
+	if (ret < 0) {
+		trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+		return -EINVAL;
+	}
+
+	if (field) {
+		struct btf_type *type = (struct btf_type *)ctx->last_struct;
+
+		ctx->offset += field - casttype;
+		ret = get_bitoffset_of_field(&field, &ctx->last_struct, ctx);
+		if (ret < 0)
+			return ret;
+		if (ret % 8) {
+			trace_probe_log_err(ctx->offset, TYPECAST_NOT_ALIGNED);
+			return -EINVAL;
+		}
+		if (field != NULL) {
+			trace_probe_log_err(ctx->offset + field - casttype, TYPECAST_BAD_ARROW);
+			return -EINVAL;
+		}
+		ctx->prefix_byteoffs = ret / 8;
+		/* Restore the original struct type (overwritten by get_bitoffset_of_field) */
+		ctx->last_struct = type;
+	}
+
+	return ret;
+}
+
 /* Find the matching closing parenthesis for a given opening parenthesis. */
 static char *find_matched_close_paren(char *s)
 {
@@ -915,11 +979,10 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 		nested = true;
 	}
 
-	ret = query_btf_struct(arg + 1, ctx);
-	if (ret < 0) {
-		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
-		return -EINVAL;
-	}
+	ctx->offset = orig_offset + 1;	/* for the '(' */
+	ret = parse_btf_casttype(arg + 1, ctx);
+	if (ret < 0)
+		return ret;
 
 	ctx->offset = orig_offset + tmp - arg;
 	/* If it is nested, tmp points to the field name. */
@@ -927,6 +990,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 		ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
 	else
 		ret = parse_btf_arg(tmp, pcode, end, ctx);
+	ctx->prefix_byteoffs = 0;
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 982d32a5df8b..44f113faae61 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -436,6 +436,7 @@ struct traceprobe_parse_context {
 	unsigned int flags;
 	int offset;
 	int nested_level;
+	int prefix_byteoffs;	/* The byte offset of the prefix field of typecast */
 };
 
 #define TRACEPROBE_MAX_NESTED_LEVEL 3
@@ -576,7 +577,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
 	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
 	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
-	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"),
+	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"), \
+	C(TYPECAST_NOT_ALIGNED,	"Typecast field option is not byte-aligned"), \
+	C(TYPECAST_BAD_ARROW,	"Typecast field option does not support -> operator"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a


^ permalink raw reply related

* [RFC PATCH v2 3/7] tracing/probes: Support nested typecast
From: Masami Hiramatsu (Google) @ 2026-06-10  0:51 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When we hit an open parenthesis right after typecast closing
parenthesis, it means we have nested typecast. This allows us to
typecast a generic data member in a structure to a pointer to
another structure.

For example, to cast a DATA_MEMBER of VAR structure to STRUCT pointer
and get MEMBER value.

   (STRUCT)(VAR->DATA_MEMBER)->MEMBER

Also, we can nest typecast.

    (STRUCT1)((STRUCT2)$ARG->FIELD2)->FIELD1

Currently the max nest level is limited to 3.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Fix to skip "->" after closing parenthetsis.
---
 Documentation/trace/eprobetrace.rst |    2 +
 Documentation/trace/fprobetrace.rst |    2 +
 Documentation/trace/kprobetrace.rst |    2 +
 kernel/trace/trace.c                |    1 
 kernel/trace/trace_probe.c          |   76 ++++++++++++++++++++++++++++++++---
 kernel/trace/trace_probe.h          |    7 +++
 6 files changed, 82 insertions(+), 8 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index fe3602540569..cd0b4aa7f896 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -50,6 +50,8 @@ Synopsis of eprobe_events
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that when this is used, the FIELD name does not
                   need to be prefixed with a '$'.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+		  also be used with another FETCHARG instead of FIELD.
 
 Types
 -----
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 7435ded2d66d..6b8bb27bb62d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -60,6 +60,8 @@ Synopsis of fprobe-events
   (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                 also be used with another FETCHARG instead of FIELD.
 
   (\*1) This is available only when BTF is enabled.
   (\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index f73614997d52..c4382765d5b2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -65,6 +65,8 @@ Synopsis of kprobe_events
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that this is available only when the probe is
 		   on function entry.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                 also be used with another FETCHARG instead of FIELD.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
         is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa93e7b01146..4f70318918c2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4326,6 +4326,7 @@ static const char readme_msg[] =
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
+	"\t           [(structname)](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9158f1f22a62..dba73aaa8ade 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -832,10 +832,35 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
 	return 0;
 }
 
+/* Find the matching closing parenthesis for a given opening parenthesis. */
+static char *find_matched_close_paren(char *s)
+{
+	char *p = s;
+	int count = 0;
+
+	while (*p) {
+		if (*p == '(')
+			count++;
+		else if (*p == ')') {
+			if (--count == 0)
+				return p;
+		}
+		p++;
+	}
+	return NULL;
+}
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+		struct fetch_insn **pcode, struct fetch_insn *end,
+		struct traceprobe_parse_context *ctx);
+
 static int handle_typecast(char *arg, struct fetch_insn **pcode,
 			   struct fetch_insn *end,
 			   struct traceprobe_parse_context *ctx)
 {
+	int orig_offset = ctx->offset;
+	bool nested = false;
 	char *tmp;
 	int ret;
 
@@ -852,19 +877,56 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 				    DEREF_OPEN_BRACE);
 		return -EINVAL;
 	}
-	*tmp = '\0';
-	ret = query_btf_struct(arg + 1, ctx);
-	*tmp = ')';
+	*tmp++ = '\0';
+
+	/* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+	if (*tmp == '(') {
+		char *close = find_matched_close_paren(tmp);
+
+		ctx->offset += tmp - arg;
+		if (!close) {
+			trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
+			return -EINVAL;
+		}
+		/* We expect a field access for typecast */
+		if (close[1] != '-' || close[2] != '>') {
+			trace_probe_log_err(ctx->offset + close - tmp + 1,
+					    TYPECAST_REQ_FIELD);
+			return -EINVAL;
+		}
 
+		ctx->nested_level++;
+		if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+			trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+			return -E2BIG;
+		}
+		*close = '\0';
+
+		ctx->offset += 1;	/* for the '(' */
+		/* We need to parse the nested one */
+		ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+				pcode, end, ctx);
+		if (ret < 0)
+			return ret;
+		ctx->nested_level--;
+		clear_struct_btf(ctx);
+
+		tmp = close + 3;/* Skip "->" after closing parenthesis */
+		nested = true;
+	}
+
+	ret = query_btf_struct(arg + 1, ctx);
 	if (ret < 0) {
 		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
 		return -EINVAL;
 	}
 
-	tmp++;
-
-	ctx->offset += tmp - arg;
-	ret = parse_btf_arg(tmp, pcode, end, ctx);
+	ctx->offset = orig_offset + tmp - arg;
+	/* If it is nested, tmp points to the field name. */
+	if (nested)
+		ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+	else
+		ret = parse_btf_arg(tmp, pcode, end, ctx);
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 883938a74aee..982d32a5df8b 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -435,8 +435,11 @@ struct traceprobe_parse_context {
 	struct trace_probe *tp;
 	unsigned int flags;
 	int offset;
+	int nested_level;
 };
 
+#define TRACEPROBE_MAX_NESTED_LEVEL 3
+
 extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
 				      const char *argv,
 				      struct traceprobe_parse_context *ctx);
@@ -571,7 +574,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
 	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
-	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"),
+	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
+	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
+	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a


^ permalink raw reply related

* [RFC PATCH v2 2/7] tracing/probes: Support typecast for various probe events
From: Masami Hiramatsu (Google) @ 2026-06-10  0:51 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Support BTF typecast feature on other probe events (but only if it is
kernel function entry or return.)

To support other probe events, we just need to use last_struct type
when we find a function parameter in parse_btf_arg().

This also update <tracefs>/README file to show struct typecast.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Fix to re-enable typecast on eprobe.
---
 Documentation/trace/fprobetrace.rst |    3 +++
 Documentation/trace/kprobetrace.rst |    4 ++++
 kernel/trace/trace.c                |    2 +-
 kernel/trace/trace_probe.c          |   14 +++++++++-----
 kernel/trace/trace_probe.h          |    5 +++++
 5 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index b4c2ca3d02c1..7435ded2d66d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,6 +57,9 @@ Synopsis of fprobe-events
                   (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                   (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                   and bitfield are supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER.
 
   (\*1) This is available only when BTF is enabled.
   (\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..f73614997d52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,6 +61,10 @@ Synopsis of kprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and bitfield are
                   supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER. Note that this is available only when the probe is
+		   on function entry.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
         is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..aa93e7b01146 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,7 +4325,7 @@ static const char readme_msg[] =
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
-	"\t           <argname>[->field[->field|.field...]],\n"
+	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fd1caa1f9723..9158f1f22a62 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -759,7 +759,10 @@ static int parse_btf_arg(char *varname,
 	return -ENOENT;
 
 found:
-	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+	if (ctx->struct_btf)
+		type = ctx->last_struct;
+	else
+		type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
 found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -836,10 +839,11 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 	char *tmp;
 	int ret;
 
-	/* Currently this only works for eprobes */
-	if (!(ctx->flags & TPARG_FL_TEVENT)) {
-		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
-		return -EINVAL;
+	if (!(tparg_is_event_probe(ctx->flags) ||
+	      tparg_is_function_entry(ctx->flags) ||
+	      tparg_is_function_return(ctx->flags))) {
+		trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+		return -EOPNOTSUPP;
 	}
 
 	tmp = strchr(arg, ')');
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 15758cc11fc6..883938a74aee 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -414,6 +414,11 @@ static inline bool tparg_is_function_return(unsigned int flags)
 	return (flags & TPARG_FL_LOC_MASK) == (TPARG_FL_KERNEL | TPARG_FL_RETURN);
 }
 
+static inline bool tparg_is_event_probe(unsigned int flags)
+{
+	return !!(flags & TPARG_FL_TEVENT);
+}
+
 struct traceprobe_parse_context {
 	struct trace_event_call *event;
 	/* BTF related parameters */


^ permalink raw reply related

* [RFC PATCH v2 1/7] tracing/events: Fix to check the simple_tsk_fn creation
From: Masami Hiramatsu (Google) @ 2026-06-10  0:51 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Sashiko pointed that this sample code does not correctly handle the
failure of thread creation because kthread_run() can return -errno.

This removes the counter-based thread creation/stops but just
checking the simple_tsk_fn is correctly initialized (created) or not.

Link: https://sashiko.dev/#/patchset/178092865666.163648.10457567771536160909.stgit%40devnote2

Fixes: 9cfe06f8cd5c ("tracing/events: add trace-events-sample")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 samples/trace_events/trace-events-sample.c |   16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index ecc7db237f2e..b61766864b54 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -92,12 +92,11 @@ static int simple_thread_fn(void *arg)
 }
 
 static DEFINE_MUTEX(thread_mutex);
-static int simple_thread_cnt;
 
 int foo_bar_reg(void)
 {
 	mutex_lock(&thread_mutex);
-	if (simple_thread_cnt++)
+	if (!IS_ERR_OR_NULL(simple_tsk_fn))
 		goto out;
 
 	pr_info("Starting thread for foo_bar_fn\n");
@@ -115,14 +114,11 @@ int foo_bar_reg(void)
 void foo_bar_unreg(void)
 {
 	mutex_lock(&thread_mutex);
-	if (--simple_thread_cnt)
-		goto out;
-
-	pr_info("Killing thread for foo_bar_fn\n");
-	if (simple_tsk_fn)
+	if (!IS_ERR_OR_NULL(simple_tsk_fn)) {
+		pr_info("Killing thread for foo_bar_fn\n");
 		kthread_stop(simple_tsk_fn);
-	simple_tsk_fn = NULL;
- out:
+		simple_tsk_fn = NULL;
+	}
 	mutex_unlock(&thread_mutex);
 }
 
@@ -139,7 +135,7 @@ static void __exit trace_event_exit(void)
 {
 	kthread_stop(simple_tsk);
 	mutex_lock(&thread_mutex);
-	if (simple_tsk_fn)
+	if (!IS_ERR_OR_NULL(simple_tsk_fn))
 		kthread_stop(simple_tsk_fn);
 	simple_tsk_fn = NULL;
 	mutex_unlock(&thread_mutex);


^ permalink raw reply related

* [RFC PATCH v2 0/7] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-10  0:51 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest

Hi,

Here is the 2nd version of series to introduce more typecast features
to probe events. The previous version is here:

 https://lore.kernel.org/all/178092865666.163648.10457567771536160909.stgit@devnote2/

In this version, I fixed various problems Sashiko reviewed and add
a fix of sample code. Also drop +CPU/PCPU() and introduce this_cpu_read().


Steve introduced BTF typecast feature for eprobe[1].
This series extends it and add more options:

1. Expanding BTF typecast to kprobe and fprobe.
   (currently only function entry/exit)

2. Introduce container_of like typecast. This adds a "assigned
   member" option to the typecast.

   (STRUCT,MEMBER)VAR->ANOTHER_MEMBER

   This casts VAR to STRUCT type but the VAR is as the address
   of STRUCT.MEMBER. In C, it is:

   container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER

3. Support nested typecast, e.g.

   (STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER

   the nest level must be smaller than 3.

4. Add $current variable to point "current" task_struct.
   This is useful with typecast, e.g.

   (task_struct)$current->pid

5. per-cpu dereference support.

   Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
   access per-cpu data on the current CPU (accessing other CPU
   data is not stable, because it can be changed.)

   You can access the member of per-cpu data structure using
   typecast like:

   (STRUCT)this_cpu_ptr(VAR)->MEMBER


And added a test script to test part of them.

[1] https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/


---

Masami Hiramatsu (Google) (7):
      tracing/events: Fix to check the simple_tsk_fn creation
      tracing/probes: Support typecast for various probe events
      tracing/probes: Support nested typecast
      tracing/probes: Support field specifier option for typecast
      tracing/probes: Add $current variable support
      tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
      tracing/probes: Add a new testcase for BTF typecasts


 Documentation/trace/eprobetrace.rst                |   10 
 Documentation/trace/fprobetrace.rst                |   10 
 Documentation/trace/kprobetrace.rst                |   11 +
 kernel/trace/trace.c                               |    6 
 kernel/trace/trace_probe.c                         |  404 +++++++++++++++-----
 kernel/trace/trace_probe.h                         |   18 +
 kernel/trace/trace_probe_tmpl.h                    |   33 +-
 samples/trace_events/trace-events-sample.c         |   56 ++-
 samples/trace_events/trace-events-sample.h         |   34 ++
 .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 +++
 10 files changed, 509 insertions(+), 124 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v2] mm/lruvec: trace LRU add drains
From: JP Kobryn @ 2026-06-10  0:29 UTC (permalink / raw)
  To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park
  Cc: linux-kernel, linux-trace-kernel

LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.

Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.

Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
---
v2:
  - removed mm_lru_drain_all tracepoint

v1: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/

 include/trace/events/pagemap.h | 19 +++++++++++++++++++
 mm/swap.c                      |  5 ++++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..fc28745507be 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,25 @@ TRACE_EVENT(mm_lru_activate,
 	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
 );
 
+TRACE_EVENT(mm_lru_add_drain,
+
+	TP_PROTO(int cpu, unsigned int nr),
+
+	TP_ARGS(cpu, nr),
+
+	TP_STRUCT__entry(
+		__field(int,		cpu	)
+		__field(unsigned int,	nr	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+		__entry->nr	= nr;
+	),
+
+	TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
+);
+
 #endif /* _TRACE_PAGEMAP_H */
 
 /* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..8fd6808bfc6e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 	struct folio_batch *fbatch = &fbatches->lru_add;
+	unsigned int nr_folios_add = folio_batch_count(fbatch);
 
-	if (folio_batch_count(fbatch))
+	if (nr_folios_add) {
 		folio_batch_move_lru(fbatch, lru_add);
+		trace_mm_lru_add_drain(cpu, nr_folios_add);
+	}
 
 	fbatch = &fbatches->lru_move_tail;
 	/* Disabling interrupts below acts as a compiler barrier. */
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-10  0:16 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <3e55e520-b979-4b1c-874c-3b4e5ca629e2@linux.dev>

On 6/9/26 5:07 PM, JP Kobryn wrote:
> On 6/9/26 12:44 AM, Barry Song wrote:
>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
>>>
>>> LRU add batches can be drained before they reach capacity. This can be a
>>> source of LRU lock contention, but it is not currently possible to
>>> attribute these drains to callers with existing tracepoints.
>>>
>>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>>> lru_add batch is drained. This allows tracing to distinguish full drains
>>> from partial drains and attribute them to the calling stack.
>>>
>>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
>>> per-CPU drain work. This captures the requester stack and target CPU for
>>> remote drain work. The event is named as a drain-all queue event because
>>> the queued work can be needed for batches other than lru_add.
>>>
>>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>>> ---
>>>  include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
>>>  mm/swap.c                      |  6 ++++-
>>>  2 files changed, 45 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>>> index 171524d3526d..ea8fc46bedb0 100644
>>> --- a/include/trace/events/pagemap.h
>>> +++ b/include/trace/events/pagemap.h
>>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
>>>         TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>>>  );
>>>
>>> +TRACE_EVENT(mm_lru_add_drain,
>>> +
>>> +       TP_PROTO(int cpu, unsigned int nr),
>>> +
>>> +       TP_ARGS(cpu, nr),
>>> +
>>> +       TP_STRUCT__entry(
>>> +               __field(int,            cpu     )
>>> +               __field(unsigned int,   nr      )
>>> +       ),
>>> +
>>> +       TP_fast_assign(
>>> +               __entry->cpu    = cpu;
>>> +               __entry->nr     = nr;
>>> +       ),
>>> +
>>> +       TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>>> +);
>>> +
>>> +TRACE_EVENT(mm_lru_drain_all_queue,
>>> +
>>> +       TP_PROTO(int target_cpu, bool force_all_cpus),
>>> +
>>> +       TP_ARGS(target_cpu, force_all_cpus),
>>> +
>>> +       TP_STRUCT__entry(
>>> +               __field(int,    target_cpu      )
>>> +               __field(bool,   force_all_cpus  )
>>> +       ),
>>> +
>>> +       TP_fast_assign(
>>> +               __entry->target_cpu     = target_cpu;
>>> +               __entry->force_all_cpus = force_all_cpus;
>>> +       ),
>>> +
>>> +       TP_printk("target_cpu=%d force_all_cpus=%s",
>>> +               __entry->target_cpu,
>>> +               __entry->force_all_cpus ? "true" : "false")
>>> +);
>>> +
>>>  #endif /* _TRACE_PAGEMAP_H */
>>>
>>>  /* This part must be outside protection */
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 588f50d8f1a8..c385b93582eb 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>>>  {
>>>         struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>>>         struct folio_batch *fbatch = &fbatches->lru_add;
>>> +       unsigned int nr_folios_add = folio_batch_count(fbatch);
>>>
>>> -       if (folio_batch_count(fbatch))
>>> +       if (nr_folios_add) {
>>>                 folio_batch_move_lru(fbatch, lru_add);
>>> +               trace_mm_lru_add_drain(cpu, nr_folios_add);
>>> +       }
>>>
>>>         fbatch = &fbatches->lru_move_tail;
>>>         /* Disabling interrupts below acts as a compiler barrier. */
>>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>>>                 if (cpu_needs_drain(cpu)) {
>>>                         INIT_WORK(work, lru_add_drain_per_cpu);
>>>                         queue_work_on(cpu, mm_percpu_wq, work);
>>> +                       trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
>>
>> Do you need tracing on each CPU individually, or is tracing the
>> entire __lru_add_drain_all() invocation sufficient?
> 
> I think the latter would be fine. The remote work will invoke the
> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
> the event already has the CPU, we could see where queued drains actually
> ran.

Actually if it's just a single invocation and the only event data is the
force flag, a tracepoint may not even be needed. Other probes can be
installed on function invocation and read the single argument. I can
drop this from v2 and keep the single mm_lru_add_drain tracepoint.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox