From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
To: Steven Rostedt <rostedt@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
Masami Hiramatsu <mhiramat@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ian Rogers <irogers@google.com>,
Namhyung Kim <namhyung@kernel.org>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
Jiri Olsa <jolsa@kernel.org>,
Douglas Raillard <douglas.raillard@arm.com>
Subject: Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
Date: Tue, 18 Nov 2025 12:08:21 +0900 [thread overview]
Message-ID: <20251118120821.0c47ef684b53d5d9a2d6dc83@kernel.org> (raw)
In-Reply-To: <20251118002950.680329246@kernel.org>
Hi Steve,
Thanks for the great idea!
On Mon, 17 Nov 2025 19:29:50 -0500
Steven Rostedt <rostedt@kernel.org> wrote:
>
> This series adds a perf event to the ftrace ring buffer.
> It is currently a proof of concept as I'm not happy with the interface
> and I also think the recorded perf event format may be changed too.
>
> This proof-of-concept interface (which I have no plans on using), currently
> just adds 6 new trace options.
>
> event_cache_misses
> event_cpu_cycles
> func-cache-misses
> func-cpu-cycles
> funcgraph-cache-misses
> funcgraph-cpu-cycles
>
> The first two trigger a perf event after every event, the second two trigger
> a perf event after every function and the last two trigger a perf event
> right after the start of a function and again at the end of the function.
>
> As this will eventual work with many more perf events than just cache-misses
> and cpu-cycles , using options is not appropriate. Especially since the
> options are limited to a 64 bit bitmask, and that can easily go much higher.
> I'm thinking about having a file instead that will act as a way to enable
> perf events for events, function and function graph tracing.
>
> set_event_perf, set_ftrace_perf, set_fgraph_perf
What about adding a global `trigger` action file so that user can
add these "perf" actions to write into it. It is something like
stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
into it too)
For pre-defined/software counters:
# echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger
For some hardware event sources (see /sys/bus/event_source/devices/):
# echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger
echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger
If we need to set those counters for tracers and events separately,
we can add `events/trigger` and `tracer-trigger` files.
echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
To disable counters, we can use '!' as same as event triggers.
echo !perf:cpu_cycles > trigger
To add more than 2 counters, connect it with ':'.
(or, we will allow to append new perf counters)
This allows user to set perf counter options for each events.
Maybe we also should move 'stacktrace'/'userstacktrace' option
flags to it too eventually.
>
> And an available_perf_events that show what can be written into these files,
> (similar to how set_ftrace_filter works). But for now, it was just easier to
> implement them as options.
>
> As for the perf event that is triggered. It currently is a dynamic array of
> 64 bit values. Each value is broken up into 8 bits for what type of perf
> event it is, and 56 bits for the counter. It only writes a per CPU raw
> counter and does not do any math. That would be needed to be done by any
> post processing.
>
> Since the values are for user space to do the subtraction to figure out the
> difference between events, for example, the function_graph tracer may have:
>
> is_vmalloc_addr() {
> /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> }
Just a style question: Would this mean the first line is for function entry
and the second one is function return?
>
> User space would subtract 2869006049 - 2869004572 = 1477
>
> Then 56 bits should be plenty.
>
> 2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
> 416 / 4 = 104
>
> If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
> days. This tooling is not for seeing how many cycles run over 104 days.
> User space tooling would just need to be aware that the vale is 56 bits and
> when calculating the difference between start and end do something like:
>
> if (start > end)
> end |= 1ULL << 56;
>
> delta = end - start;
>
> The next question is how to label the perf events to be in the 8 bit
> portion. It could simply be a value that is registered, and listed in the
> available_perf_events file.
>
> cpu_cycles:1
> cach_misses:2
> [..]
Looks good to me. I think pre-definied events of `perf list`
will be there and have fixed numbers.
Thank you,
>
> And this would need to be recorded by any tooling reading the events
> so that it knows how to map the events with their attached ids.
>
> But again, this is just a proof-of-concept. How this will eventually be
> implemented is yet to be determined.
>
> But to test these patches (which are based on top of my linux-next branch,
> which should now be in linux-next):
>
> # cd /sys/kernel/tracing
> # echo 1 > options/event_cpu_cycles
> # echo 1 > options/event_cache_misses
> # echo 1 > events/syscalls/enable
> # cat trace
> [..]
> bash-995 [007] ..... 98.255252: sys_write -> 0x2
> bash-995 [007] ..... 98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
> bash-995 [007] ..... 98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
> bash-995 [007] ..... 98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
> bash-995 [007] ..... 98.255305: sys_dup2 -> 0x1
> bash-995 [007] ..... 98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
> bash-995 [007] ..... 98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
> bash-995 [007] ..... 98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
> bash-995 [007] ..... 98.255352: sys_fcntl -> 0x1
> bash-995 [007] ..... 98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
> bash-995 [007] ..... 98.255361: sys_close(fd: 0xa)
> bash-995 [007] ..... 98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
> bash-995 [007] ..... 98.255369: sys_close -> 0x0
>
>
>
> Comments welcomed.
>
>
> Steven Rostedt (3):
> tracing: Add perf events
> ftrace: Add perf counters to function tracing
> fgraph: Add perf counters to function graph tracer
>
> ----
> include/linux/trace_recursion.h | 5 +-
> kernel/trace/trace.c | 153 ++++++++++++++++++++++++++++++++-
> kernel/trace/trace.h | 38 ++++++++
> kernel/trace/trace_entries.h | 13 +++
> kernel/trace/trace_event_perf.c | 162 +++++++++++++++++++++++++++++++++++
> kernel/trace/trace_functions.c | 124 +++++++++++++++++++++++++--
> kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
> kernel/trace/trace_output.c | 70 +++++++++++++++
> 8 files changed, 670 insertions(+), 12 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
next prev parent reply other threads:[~2025-11-18 3:08 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-18 0:29 [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 1/3] tracing: Add perf events Steven Rostedt
2025-11-18 8:35 ` Peter Zijlstra
2025-11-18 13:42 ` Steven Rostedt
2025-11-18 20:24 ` Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 2/3] ftrace: Add perf counters to function tracing Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 3/3] fgraph: Add perf counters to function graph tracer Steven Rostedt
2025-11-18 3:08 ` Masami Hiramatsu [this message]
2025-11-18 3:42 ` [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
2025-11-18 8:11 ` Masami Hiramatsu
2025-11-18 13:53 ` Steven Rostedt
2025-11-18 13:57 ` Steven Rostedt
2025-11-18 16:31 ` Steven Rostedt
2025-11-18 7:25 ` Namhyung Kim
2025-11-18 16:24 ` Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251118120821.0c47ef684b53d5d9a2d6dc83@kernel.org \
--to=mhiramat@kernel.org \
--cc=acme@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=douglas.raillard@arm.com \
--cc=irogers@google.com \
--cc=jolsa@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=rostedt@kernel.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).