* [PATCH] Documentation: tracing: Add documentation about eprobes
@ 2025-07-28 21:15 Steven Rostedt
2025-07-29 1:02 ` Randy Dunlap
2025-07-29 1:26 ` Masami Hiramatsu
0 siblings, 2 replies; 6+ messages in thread
From: Steven Rostedt @ 2025-07-28 21:15 UTC (permalink / raw)
To: LKML, Linux trace kernel, linux-doc
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Andrew Morton,
Namhyung Kim, Jonathan Corbet
From: Steven Rostedt <rostedt@goodmis.org>
Eprobes was added back in 5.15, but was never documented. It became a
"secret" interface even though it has been a topic of several
presentations. For some reason, when eprobes was added, documenting it
never became a priority, until now.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Documentation/trace/eprobes.rst | 268 ++++++++++++++++++++++++++++++++
Documentation/trace/index.rst | 1 +
2 files changed, 269 insertions(+)
create mode 100644 Documentation/trace/eprobes.rst
diff --git a/Documentation/trace/eprobes.rst b/Documentation/trace/eprobes.rst
new file mode 100644
index 000000000000..c7aa7c867e9e
--- /dev/null
+++ b/Documentation/trace/eprobes.rst
@@ -0,0 +1,268 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Eprobe - Event probes
+=====================
+
+:Author: Steven Rostedt <rostedt@goodmis.org>
+
+- Written for v6.17
+
+Overview
+========
+
+Eprobes are dynamic events that are placed on existing events to eiter
+dereference a field that is a pointer, or simply to limit what fields get
+recorded in the trace event.
+
+Eprobes depend on kprobe events so to enable this feature, build your kernel
+with CONFIG_KPROBE_EVENTS=y.
+
+Eprobes are created via the /sys/kernel/tracing/dynamic_events file.
+
+Synopsis of eprobe_events
+-------------------------
+::
+
+ e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe
+ -:[EGRP/][EEVENT] : Clear a probe
+
+ EGRP : Group name of the new event. If omitted, use "eprobes" for it.
+ EEVENT : Event name. If omitted, the event name is generated and will
+ be the same event name as the event it attached to.
+ GRP : Group name of the event to attach to.
+ EVENT : Event name of the event to attach to.
+
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
+ $FIELD : Fetch the value of the event field called FIELD.
+ @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
+ @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
+ $comm : Fetch current task comm.
+ +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ \IMM : Store an immediate value to the argument.
+ NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
+ FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
+ (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
+ (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
+ "string", "ustring", "symbol", "symstr" and bitfield are
+ supported.
+
+Types
+-----
+The FETCHARGS above is very similar to the kprobe events as described in
+Documentation/trace/kprobetrace.rst.
+
+The difference between eprobes and kprobes FETCHARGS is that eprobes has a
+$FIELD command that returns the content of the event field of the event
+that is attached. Eprobes do not have access to registers, stacks and function
+arguments that kprobes has.
+
+If a field argument is a pointer, it may be dereferenced just like a memory
+address using the FETCHARGS syntax.
+
+
+Attaching to dynamic events
+---------------------------
+
+Note that eprobes may attach to dynamic events as well as to normal events. It
+may attach to a kprobe event, a synthetic event or a fprobe event. This is
+useful if the type of a field needs to be changed. See Example 2 below.
+
+Usage examples
+==============
+
+Example 1
+---------
+
+The basic usage of eprobes is to limit the data that is being recorded into
+the tracing buffer. For example, a common event to trace is the sched_switch
+trace event. That has a format of::
+
+ field:unsigned short common_type; offset:0; size:2; signed:0;
+ field:unsigned char common_flags; offset:2; size:1; signed:0;
+ field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
+ field:int common_pid; offset:4; size:4; signed:1;
+
+ field:char prev_comm[16]; offset:8; size:16; signed:0;
+ field:pid_t prev_pid; offset:24; size:4; signed:1;
+ field:int prev_prio; offset:28; size:4; signed:1;
+ field:long prev_state; offset:32; size:8; signed:1;
+ field:char next_comm[16]; offset:40; size:16; signed:0;
+ field:pid_t next_pid; offset:56; size:4; signed:1;
+ field:int next_prio; offset:60; size:4; signed:1;
+
+The first four fields are common to all events and can not be limited. But the
+rest of the event has 60 bytes of information. It records the names of the
+previous and next tasks being scheduled out and in, as well as their pids and
+priorities. It also records the state of the previous task. If only the pids
+of the tasks are of interest, why waste the ring buffer with all the other
+fields?
+
+An eprobe can limit what gets recorded. Note, it does not help in performance,
+as all the fields are recorded in a temporary buffer to process the eprobe.
+::
+
+ # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events
+ # echo 1 > /sys/kernel/tracing/events/sched/switch/enable
+ # cat /sys/kernel/tracing/trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 2721/2721 #P:8
+ #
+ # _-----=> irqs-off/BH-disabled
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0
+ bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141
+ kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085
+ <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082
+ bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141
+ kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085
+ bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0
+ sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0
+ <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90
+ <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16
+ kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0
+ rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0
+ <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16
+ rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0
+
+Note, without adding the "u32" after the prev_pid and next_pid, the values
+would default showing in hexadecimal.
+
+Example 2
+---------
+
+If syscall events are not enabled but the raw syscall are (systemcall
+events are not normal events, but are created from the raw_syscall events
+within the kernel). In order to trace the openat system call, one can create
+an event probe on top of the raw_syscall event:
+::
+
+ # cd /sys/kernel/tracing
+ # cat events/raw_syscalls/sys_enter/format
+ name: sys_enter
+ ID: 395
+ format:
+ field:unsigned short common_type; offset:0; size:2; signed:0;
+ field:unsigned char common_flags; offset:2; size:1; signed:0;
+ field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
+ field:int common_pid; offset:4; size:4; signed:1;
+
+ field:long id; offset:8; size:8; signed:1;
+ field:unsigned long args[6]; offset:16; size:48; signed:0;
+
+ print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
+
+From the source code, the sys_openat() has:
+::
+
+ int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
+ {
+ return my_syscall4(__NR_openat, dirfd, path, flags, mode);
+ }
+
+The path is the second parameter, and that is what is wanted.
+::
+
+ # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events
+
+This is being run on x86_64 where the word size is 8 bytes and the openat
+systemcall __NR_openat is set at 257.
+::
+
+ # echo 'nr == 257' > events/eprobes/openat/filter
+
+Now enable the event and look at the trace.
+::
+
+ # echo 1 > events/eprobes/openat/enable
+ # cat trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 4/4 #P:8
+ #
+ # _-----=> irqs-off/BH-disabled
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+ cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+ cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+ cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+
+The filename shows "(fault)". This is likely because the filename has not been
+pulled into memory yet and currently trace events cannot fault in memory that
+is not present. When a eprobe tries to read memory that has not been faulted
+in yet, it will show the "(fault)" text.
+
+To get around this, as the kernel will likely pull in this filename and make
+it present, attaching it to a synthetic event that can pass the address of the
+filename from the entry of the event to the end of the event, this can be used
+to show the filename when the system call returns.
+
+Remove the old eprobe::
+
+ # echo 1 > events/eprobes/openat/enable
+ # echo '-:openat' >> dynamic_events
+
+This time make an eprobe where the address of the filename is saved::
+
+ # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events
+
+Create a synthetic event that passes the address of the filename to the
+end of the event::
+
+ # echo 's:filename u64 file' >> dynamic_events
+ # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger
+ # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger
+
+Now that the address of the filename has been passed to the end of the
+systemcall, create another eprobe to attach to the exit event to show the
+string::
+
+ # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
+ # echo 1 > events/eprobes/openat/enable
+ # cat trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 4/4 #P:8
+ #
+ # _-----=> irqs-off/BH-disabled
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache"
+ cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6"
+ cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive"
+ cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace"
+
+Example 3
+---------
+
+If syscall trace events are available, the above would not need the first
+eprobe, but it would still need the last one::
+
+ # echo 's:filename u64 file' >> dynamic_events
+ # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger
+ # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger
+ # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
+ # echo 1 > events/eprobes/openat/enable
+
+And this would produce the same result as Example 2.
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index cc1dc5a087e8..812d9a46dd94 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -37,6 +37,7 @@ the Linux kernel.
kprobetrace
fprobetrace
fprobe
+ eprobes
ring-buffer-design
Event Tracing and Analysis
--
2.47.2
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation: tracing: Add documentation about eprobes
2025-07-28 21:15 [PATCH] Documentation: tracing: Add documentation about eprobes Steven Rostedt
@ 2025-07-29 1:02 ` Randy Dunlap
2025-07-29 15:50 ` Steven Rostedt
2025-07-29 1:26 ` Masami Hiramatsu
1 sibling, 1 reply; 6+ messages in thread
From: Randy Dunlap @ 2025-07-29 1:02 UTC (permalink / raw)
To: Steven Rostedt, LKML, Linux trace kernel, linux-doc
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Andrew Morton,
Namhyung Kim, Jonathan Corbet
On 7/28/25 2:15 PM, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Eprobes was added back in 5.15, but was never documented. It became a
> "secret" interface even though it has been a topic of several
> presentations. For some reason, when eprobes was added, documenting it
> never became a priority, until now.
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Documentation/trace/eprobes.rst | 268 ++++++++++++++++++++++++++++++++
> Documentation/trace/index.rst | 1 +
> 2 files changed, 269 insertions(+)
> create mode 100644 Documentation/trace/eprobes.rst
>
> diff --git a/Documentation/trace/eprobes.rst b/Documentation/trace/eprobes.rst
> new file mode 100644
> index 000000000000..c7aa7c867e9e
> --- /dev/null
> +++ b/Documentation/trace/eprobes.rst
> @@ -0,0 +1,268 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Eprobe - Event probes
> +=====================
> +
> +:Author: Steven Rostedt <rostedt@goodmis.org>
> +
> +- Written for v6.17
> +
> +Overview
> +========
> +
> +Eprobes are dynamic events that are placed on existing events to eiter
either
> +dereference a field that is a pointer, or simply to limit what fields get
(preference:) are
> +recorded in the trace event.
> +
> +Eprobes depend on kprobe events so to enable this feature, build your kernel
s/,/;/
> +with CONFIG_KPROBE_EVENTS=y.
> +
> +Eprobes are created via the /sys/kernel/tracing/dynamic_events file.
> +
> +Synopsis of eprobe_events
> +-------------------------
> +::
> +
> + e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe
> + -:[EGRP/][EEVENT] : Clear a probe
> +
> + EGRP : Group name of the new event. If omitted, use "eprobes" for it.
> + EEVENT : Event name. If omitted, the event name is generated and will
> + be the same event name as the event it attached to.
> + GRP : Group name of the event to attach to.
> + EVENT : Event name of the event to attach to.
> +
> + FETCHARGS : Arguments. Each probe can have up to 128 args.
> + $FIELD : Fetch the value of the event field called FIELD.
> + @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
> + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
> + $comm : Fetch current task comm.
> + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
> + \IMM : Store an immediate value to the argument.
> + NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
> + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
> + (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
> + (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
> + "string", "ustring", "symbol", "symstr" and bitfield are
Should bitfield be quoted?
> + supported.
> +
> +Types
> +-----
> +The FETCHARGS above is very similar to the kprobe events as described in
> +Documentation/trace/kprobetrace.rst.
> +
> +The difference between eprobes and kprobes FETCHARGS is that eprobes has a
> +$FIELD command that returns the content of the event field of the event
> +that is attached. Eprobes do not have access to registers, stacks and function
> +arguments that kprobes has.
> +
> +If a field argument is a pointer, it may be dereferenced just like a memory
> +address using the FETCHARGS syntax.
> +
> +
> +Attaching to dynamic events
> +---------------------------
> +
> +Note that eprobes may attach to dynamic events as well as to normal events. It
Don't need "Note that".
> +may attach to a kprobe event, a synthetic event or a fprobe event. This is
I would say: an fprobe event.
> +useful if the type of a field needs to be changed. See Example 2 below.
> +
> +Usage examples
> +==============
> +
> +Example 1
> +---------
> +
> +The basic usage of eprobes is to limit the data that is being recorded into
> +the tracing buffer. For example, a common event to trace is the sched_switch
> +trace event. That has a format of::
> +
> + field:unsigned short common_type; offset:0; size:2; signed:0;
> + field:unsigned char common_flags; offset:2; size:1; signed:0;
> + field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> + field:int common_pid; offset:4; size:4; signed:1;
> +
> + field:char prev_comm[16]; offset:8; size:16; signed:0;
> + field:pid_t prev_pid; offset:24; size:4; signed:1;
> + field:int prev_prio; offset:28; size:4; signed:1;
> + field:long prev_state; offset:32; size:8; signed:1;
> + field:char next_comm[16]; offset:40; size:16; signed:0;
> + field:pid_t next_pid; offset:56; size:4; signed:1;
> + field:int next_prio; offset:60; size:4; signed:1;
> +
> +The first four fields are common to all events and can not be limited. But the
> +rest of the event has 60 bytes of information. It records the names of the
> +previous and next tasks being scheduled out and in, as well as their pids and
> +priorities. It also records the state of the previous task. If only the pids
> +of the tasks are of interest, why waste the ring buffer with all the other
> +fields?
> +
> +An eprobe can limit what gets recorded. Note, it does not help in performance,
> +as all the fields are recorded in a temporary buffer to process the eprobe.
> +::
> +
> + # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events
> + # echo 1 > /sys/kernel/tracing/events/sched/switch/enable
> + # cat /sys/kernel/tracing/trace
> +
> + # tracer: nop
> + #
> + # entries-in-buffer/entries-written: 2721/2721 #P:8
> + #
> + # _-----=> irqs-off/BH-disabled
> + # / _----=> need-resched
> + # | / _---=> hardirq/softirq
> + # || / _--=> preempt-depth
> + # ||| / _-=> migrate-disable
> + # |||| / delay
> + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> + # | | | ||||| | |
> + sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0
> + bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141
> + kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085
> + <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082
> + bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141
> + kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085
> + bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0
> + sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0
> + <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90
> + <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16
> + kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0
> + rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0
> + <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16
> + rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0
> +
> +Note, without adding the "u32" after the prev_pid and next_pid, the values
> +would default showing in hexadecimal.
> +
> +Example 2
> +---------
> +
> +If syscall events are not enabled but the raw syscall are (systemcall
syscalls are (system call
> +events are not normal events, but are created from the raw_syscall events
> +within the kernel). In order to trace the openat system call, one can create
^^ not a complete sentence.
> +an event probe on top of the raw_syscall event:
> +::
> +
> + # cd /sys/kernel/tracing
> + # cat events/raw_syscalls/sys_enter/format
> + name: sys_enter
> + ID: 395
> + format:
> + field:unsigned short common_type; offset:0; size:2; signed:0;
> + field:unsigned char common_flags; offset:2; size:1; signed:0;
> + field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> + field:int common_pid; offset:4; size:4; signed:1;
> +
> + field:long id; offset:8; size:8; signed:1;
> + field:unsigned long args[6]; offset:16; size:48; signed:0;
> +
> + print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
> +
> +From the source code, the sys_openat() has:
> +::
> +
> + int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
> + {
> + return my_syscall4(__NR_openat, dirfd, path, flags, mode);
> + }
> +
> +The path is the second parameter, and that is what is wanted.
s/wanted/want/
> +::
> +
> + # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events
> +
> +This is being run on x86_64 where the word size is 8 bytes and the openat
> +systemcall __NR_openat is set at 257.
system call
> +::
> +
> + # echo 'nr == 257' > events/eprobes/openat/filter
> +
> +Now enable the event and look at the trace.
> +::
> +
> + # echo 1 > events/eprobes/openat/enable
> + # cat trace
> +
> + # tracer: nop
> + #
> + # entries-in-buffer/entries-written: 4/4 #P:8
> + #
> + # _-----=> irqs-off/BH-disabled
> + # / _----=> need-resched
> + # | / _---=> hardirq/softirq
> + # || / _--=> preempt-depth
> + # ||| / _-=> migrate-disable
> + # |||| / delay
> + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> + # | | | ||||| | |
> + cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
> + cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
> + cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
> + cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
> +
> +The filename shows "(fault)". This is likely because the filename has not been
> +pulled into memory yet and currently trace events cannot fault in memory that
> +is not present. When a eprobe tries to read memory that has not been faulted
an eprobe
> +in yet, it will show the "(fault)" text.
> +
> +To get around this, as the kernel will likely pull in this filename and make
> +it present, attaching it to a synthetic event that can pass the address of the
> +filename from the entry of the event to the end of the event, this can be used
> +to show the filename when the system call returns.
> +
> +Remove the old eprobe::
> +
> + # echo 1 > events/eprobes/openat/enable
> + # echo '-:openat' >> dynamic_events
> +
> +This time make an eprobe where the address of the filename is saved::
> +
> + # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events
> +
> +Create a synthetic event that passes the address of the filename to the
> +end of the event::
> +
> + # echo 's:filename u64 file' >> dynamic_events
> + # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger
> + # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger
> +
> +Now that the address of the filename has been passed to the end of the
> +systemcall, create another eprobe to attach to the exit event to show the
system call,
> +string::
> +
> + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
> + # echo 1 > events/eprobes/openat/enable
> + # cat trace
> +
> + # tracer: nop
> + #
> + # entries-in-buffer/entries-written: 4/4 #P:8
> + #
> + # _-----=> irqs-off/BH-disabled
> + # / _----=> need-resched
> + # | / _---=> hardirq/softirq
> + # || / _--=> preempt-depth
> + # ||| / _-=> migrate-disable
> + # |||| / delay
> + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> + # | | | ||||| | |
> + cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache"
> + cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6"
> + cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive"
> + cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace"
> +
> +Example 3
> +---------
> +
> +If syscall trace events are available, the above would not need the first
> +eprobe, but it would still need the last one::
> +
> + # echo 's:filename u64 file' >> dynamic_events
> + # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger
> + # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger
> + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
> + # echo 1 > events/eprobes/openat/enable
> +
> +And this would produce the same result as Example 2.
Thanks for the new documentation.
--
~Randy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation: tracing: Add documentation about eprobes
2025-07-28 21:15 [PATCH] Documentation: tracing: Add documentation about eprobes Steven Rostedt
2025-07-29 1:02 ` Randy Dunlap
@ 2025-07-29 1:26 ` Masami Hiramatsu
2025-07-29 15:54 ` Steven Rostedt
1 sibling, 1 reply; 6+ messages in thread
From: Masami Hiramatsu @ 2025-07-29 1:26 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux trace kernel, linux-doc, Masami Hiramatsu,
Mathieu Desnoyers, Mark Rutland, Andrew Morton, Namhyung Kim,
Jonathan Corbet
On Mon, 28 Jul 2025 17:15:22 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Eprobes was added back in 5.15, but was never documented. It became a
> "secret" interface even though it has been a topic of several
> presentations. For some reason, when eprobes was added, documenting it
> never became a priority, until now.
>
Thanks for the document!
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Documentation/trace/eprobes.rst | 268 ++++++++++++++++++++++++++++++++
> Documentation/trace/index.rst | 1 +
> 2 files changed, 269 insertions(+)
> create mode 100644 Documentation/trace/eprobes.rst
>
> diff --git a/Documentation/trace/eprobes.rst b/Documentation/trace/eprobes.rst
> new file mode 100644
> index 000000000000..c7aa7c867e9e
> --- /dev/null
> +++ b/Documentation/trace/eprobes.rst
BTW, can't you rename it as 'eprobetrace.rst' as same as others?
I usually name the doc of "a probe feature which provides only in-kernel
APIs" as '*probe.rst' and the doc of "a probe *event* feature which can
controlled via tracefs interface" as '*probetrace.rst'.
> @@ -0,0 +1,268 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Eprobe - Event probes
What about below title?
Eprobe - Event-based Probe Tracing
> +=====================
> +
> +:Author: Steven Rostedt <rostedt@goodmis.org>
> +
> +- Written for v6.17
> +
> +Overview
> +========
> +
> +Eprobes are dynamic events that are placed on existing events to eiter
> +dereference a field that is a pointer, or simply to limit what fields get
> +recorded in the trace event.
> +
> +Eprobes depend on kprobe events so to enable this feature, build your kernel
> +with CONFIG_KPROBE_EVENTS=y.
Is this correct? It seems that eprobe event only depends on event trigger
(by implementation. Actually we should fix the kernel/trace/Kconfig.)
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/tree/kernel/trace/trace_eprobe.c?h=trace-v6.16-rc5#n576
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation: tracing: Add documentation about eprobes
2025-07-29 1:02 ` Randy Dunlap
@ 2025-07-29 15:50 ` Steven Rostedt
2025-07-29 16:17 ` Randy Dunlap
0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2025-07-29 15:50 UTC (permalink / raw)
To: Randy Dunlap
Cc: LKML, Linux trace kernel, linux-doc, Masami Hiramatsu,
Mathieu Desnoyers, Mark Rutland, Andrew Morton, Namhyung Kim,
Jonathan Corbet
On Mon, 28 Jul 2025 18:02:37 -0700
Randy Dunlap <rdunlap@infradead.org> wrote:
> > +Overview
> > +========
> > +
> > +Eprobes are dynamic events that are placed on existing events to eiter
>
> either
>
> > +dereference a field that is a pointer, or simply to limit what fields get
>
> (preference:) are
>
> > +recorded in the trace event.
> > +
> > +Eprobes depend on kprobe events so to enable this feature, build your kernel
>
> s/,/;/
OK.
> > + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
> > + (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
> > + (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
> > + "string", "ustring", "symbol", "symstr" and bitfield are
>
> Should bitfield be quoted?
Hmm, probably. And we should fix kprobetrace.rst as that's where I copied it from.
> > +
> > +Attaching to dynamic events
> > +---------------------------
> > +
> > +Note that eprobes may attach to dynamic events as well as to normal events. It
>
> Don't need "Note that".
OK.
>
> > +may attach to a kprobe event, a synthetic event or a fprobe event. This is
>
> I would say: an fprobe event.
>
OK.
> > +
> > +Example 2
> > +---------
> > +
> > +If syscall events are not enabled but the raw syscall are (systemcall
>
> syscalls are (system call
>
> > +events are not normal events, but are created from the raw_syscall events
> > +within the kernel). In order to trace the openat system call, one can create
>
> ^^ not a complete sentence.
Ah, I'm thinking that "This example is for the case that syscall events
are not enabled..."
But it came out as the above. Will fix.
>
>
> > +an event probe on top of the raw_syscall event:
> > +::
> > +
> > + # cd /sys/kernel/tracing
> > + # cat events/raw_syscalls/sys_enter/format
> > + name: sys_enter
> > + ID: 395
> > + format:
> > + field:unsigned short common_type; offset:0; size:2; signed:0;
> > + field:unsigned char common_flags; offset:2; size:1; signed:0;
> > + field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> > + field:int common_pid; offset:4; size:4; signed:1;
> > +
> > + field:long id; offset:8; size:8; signed:1;
> > + field:unsigned long args[6]; offset:16; size:48; signed:0;
> > +
> > + print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
> > +
> > +From the source code, the sys_openat() has:
> > +::
> > +
> > + int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
> > + {
> > + return my_syscall4(__NR_openat, dirfd, path, flags, mode);
> > + }
> > +
> > +The path is the second parameter, and that is what is wanted.
>
> s/wanted/want/
Really? That sounds funny to me:
The path is the second parameter and that is what is want.
??
>
> > +::
> > +
> > + # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events
> > +
> > +This is being run on x86_64 where the word size is 8 bytes and the openat
> > +systemcall __NR_openat is set at 257.
>
> system call
OK.
> > +The filename shows "(fault)". This is likely because the filename has not been
> > +pulled into memory yet and currently trace events cannot fault in memory that
> > +is not present. When a eprobe tries to read memory that has not been faulted
>
> an eprobe
OK.
> > +Now that the address of the filename has been passed to the end of the
> > +systemcall, create another eprobe to attach to the exit event to show the
>
> system call,
OK.
>
> Thanks for the new documentation.
>
It was a long time coming :-p
-- Steve
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation: tracing: Add documentation about eprobes
2025-07-29 1:26 ` Masami Hiramatsu
@ 2025-07-29 15:54 ` Steven Rostedt
0 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2025-07-29 15:54 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: LKML, Linux trace kernel, linux-doc, Mathieu Desnoyers,
Mark Rutland, Andrew Morton, Namhyung Kim, Jonathan Corbet
On Tue, 29 Jul 2025 10:26:36 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > diff --git a/Documentation/trace/eprobes.rst b/Documentation/trace/eprobes.rst
> > new file mode 100644
> > index 000000000000..c7aa7c867e9e
> > --- /dev/null
> > +++ b/Documentation/trace/eprobes.rst
>
> BTW, can't you rename it as 'eprobetrace.rst' as same as others?
> I usually name the doc of "a probe feature which provides only in-kernel
> APIs" as '*probe.rst' and the doc of "a probe *event* feature which can
> controlled via tracefs interface" as '*probetrace.rst'.
Sure, no problem. Will update.
>
> > @@ -0,0 +1,268 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=====================
> > +Eprobe - Event probes
>
> What about below title?
>
> Eprobe - Event-based Probe Tracing
OK.
>
> > +=====================
> > +
> > +:Author: Steven Rostedt <rostedt@goodmis.org>
> > +
> > +- Written for v6.17
> > +
> > +Overview
> > +========
> > +
> > +Eprobes are dynamic events that are placed on existing events to eiter
> > +dereference a field that is a pointer, or simply to limit what fields get
> > +recorded in the trace event.
> > +
> > +Eprobes depend on kprobe events so to enable this feature, build your kernel
> > +with CONFIG_KPROBE_EVENTS=y.
>
> Is this correct? It seems that eprobe event only depends on event trigger
> (by implementation. Actually we should fix the kernel/trace/Kconfig.)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/tree/kernel/trace/trace_eprobe.c?h=trace-v6.16-rc5#n576
I noticed there was no config that selected it, and it only gets
selected by CONFIG_PROBE_EVENTS which requires something else to select
it. I guess adding an EPROBE config would be useful too. Let me do that
too.
Thanks for the review.
-- Steve
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation: tracing: Add documentation about eprobes
2025-07-29 15:50 ` Steven Rostedt
@ 2025-07-29 16:17 ` Randy Dunlap
0 siblings, 0 replies; 6+ messages in thread
From: Randy Dunlap @ 2025-07-29 16:17 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux trace kernel, linux-doc, Masami Hiramatsu,
Mathieu Desnoyers, Mark Rutland, Andrew Morton, Namhyung Kim,
Jonathan Corbet
Hi,
>>> +From the source code, the sys_openat() has:
>>> +::
>>> +
>>> + int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
>>> + {
>>> + return my_syscall4(__NR_openat, dirfd, path, flags, mode);
>>> + }
>>> +
>>> +The path is the second parameter, and that is what is wanted.
>>
>> s/wanted/want/
>
> Really? That sounds funny to me:
>
> The path is the second parameter and that is what is want.
>
> ??
Yeah, I don't know how I did that. Sorry.
--
~Randy
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-07-29 16:17 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-28 21:15 [PATCH] Documentation: tracing: Add documentation about eprobes Steven Rostedt
2025-07-29 1:02 ` Randy Dunlap
2025-07-29 15:50 ` Steven Rostedt
2025-07-29 16:17 ` Randy Dunlap
2025-07-29 1:26 ` Masami Hiramatsu
2025-07-29 15:54 ` Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).