* [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
@ 2025-11-18 0:29 Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 1/3] tracing: Add perf events Steven Rostedt
` (4 more replies)
0 siblings, 5 replies; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 0:29 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Thomas Gleixner, Ian Rogers, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
This series adds a perf event to the ftrace ring buffer.
It is currently a proof of concept as I'm not happy with the interface
and I also think the recorded perf event format may be changed too.
This proof-of-concept interface (which I have no plans on using), currently
just adds 6 new trace options.
event_cache_misses
event_cpu_cycles
func-cache-misses
func-cpu-cycles
funcgraph-cache-misses
funcgraph-cpu-cycles
The first two trigger a perf event after every event, the second two trigger
a perf event after every function and the last two trigger a perf event
right after the start of a function and again at the end of the function.
As this will eventual work with many more perf events than just cache-misses
and cpu-cycles , using options is not appropriate. Especially since the
options are limited to a 64 bit bitmask, and that can easily go much higher.
I'm thinking about having a file instead that will act as a way to enable
perf events for events, function and function graph tracing.
set_event_perf, set_ftrace_perf, set_fgraph_perf
And an available_perf_events that show what can be written into these files,
(similar to how set_ftrace_filter works). But for now, it was just easier to
implement them as options.
As for the perf event that is triggered. It currently is a dynamic array of
64 bit values. Each value is broken up into 8 bits for what type of perf
event it is, and 56 bits for the counter. It only writes a per CPU raw
counter and does not do any math. That would be needed to be done by any
post processing.
Since the values are for user space to do the subtraction to figure out the
difference between events, for example, the function_graph tracer may have:
is_vmalloc_addr() {
/* cpu_cycles: 5582263593 cache_misses: 2869004572 */
/* cpu_cycles: 5582267527 cache_misses: 2869006049 */
}
User space would subtract 2869006049 - 2869004572 = 1477
Then 56 bits should be plenty.
2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
416 / 4 = 104
If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
days. This tooling is not for seeing how many cycles run over 104 days.
User space tooling would just need to be aware that the vale is 56 bits and
when calculating the difference between start and end do something like:
if (start > end)
end |= 1ULL << 56;
delta = end - start;
The next question is how to label the perf events to be in the 8 bit
portion. It could simply be a value that is registered, and listed in the
available_perf_events file.
cpu_cycles:1
cach_misses:2
[..]
And this would need to be recorded by any tooling reading the events
so that it knows how to map the events with their attached ids.
But again, this is just a proof-of-concept. How this will eventually be
implemented is yet to be determined.
But to test these patches (which are based on top of my linux-next branch,
which should now be in linux-next):
# cd /sys/kernel/tracing
# echo 1 > options/event_cpu_cycles
# echo 1 > options/event_cache_misses
# echo 1 > events/syscalls/enable
# cat trace
[..]
bash-995 [007] ..... 98.255252: sys_write -> 0x2
bash-995 [007] ..... 98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
bash-995 [007] ..... 98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
bash-995 [007] ..... 98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
bash-995 [007] ..... 98.255305: sys_dup2 -> 0x1
bash-995 [007] ..... 98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
bash-995 [007] ..... 98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
bash-995 [007] ..... 98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
bash-995 [007] ..... 98.255352: sys_fcntl -> 0x1
bash-995 [007] ..... 98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
bash-995 [007] ..... 98.255361: sys_close(fd: 0xa)
bash-995 [007] ..... 98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
bash-995 [007] ..... 98.255369: sys_close -> 0x0
Comments welcomed.
Steven Rostedt (3):
tracing: Add perf events
ftrace: Add perf counters to function tracing
fgraph: Add perf counters to function graph tracer
----
include/linux/trace_recursion.h | 5 +-
kernel/trace/trace.c | 153 ++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 38 ++++++++
kernel/trace/trace_entries.h | 13 +++
kernel/trace/trace_event_perf.c | 162 +++++++++++++++++++++++++++++++++++
kernel/trace/trace_functions.c | 124 +++++++++++++++++++++++++--
kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
kernel/trace/trace_output.c | 70 +++++++++++++++
8 files changed, 670 insertions(+), 12 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* [POC][RFC][PATCH 1/3] tracing: Add perf events
2025-11-18 0:29 [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
@ 2025-11-18 0:29 ` Steven Rostedt
2025-11-18 8:35 ` Peter Zijlstra
2025-11-18 0:29 ` [POC][RFC][PATCH 2/3] ftrace: Add perf counters to function tracing Steven Rostedt
` (3 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 0:29 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Thomas Gleixner, Ian Rogers, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Add perf events into the ftrace ring buffer. Create a new ftrace event
called a "perf_event". This event contains a dynamic array of u64 words.
Each entry allows to read 56 bits of the raw content of a perf PMU value
into the word leaving 8 bits as an identifier for what word that is.
One may ask "what happens when the counter is greater than 56 bits". The
answer is that you really shouldn't care. The value is written for user
space to consume and do any calculations. If one wants to see the
difference between two events, they can simply subtract the previous one
from the next one. If there is a wrap over the 56 bits, then adding a
"1ULL << 56" to the second value if it is less than the first will give
the correct result.
"What happens if the difference of the counters is 1 << 55 apart?"
Let's look at CPU cycles, as they probably go up the quickest. At 4GHz,
that would be 4,000,000,000 times a second.
1 << 55 / 400000000 = 9007199 seconds
9007199 / 60 = 150119 minutes
150119 / 60 = 2501 hours
2501 / 24 = 104 days!
This will not work if you want to see the number of cycles between two
events if those two events are 104 days apart. Do we care?
Currently only cpu cycles and cache misses are supported, but more can be
added in the future.
Two new options are added: event_cache_misses and event_cpu_cycles
# cd /sys/kernel/tracing
# echo 1 > options/event_cache_misses
# echo 1 > events/syscalls/enable
# cat trace
[..]
bash-1009 [005] ..... 566.863956: sys_write -> 0x2
bash-1009 [005] ..... 566.863973: cache_misses: 26544738
bash-1009 [005] ..... 566.864003: sys_dup2(oldfd: 0xa, newfd: 1)
bash-1009 [005] ..... 566.864004: cache_misses: 26546241
bash-1009 [005] ..... 566.864021: sys_dup2 -> 0x1
bash-1009 [005] ..... 566.864022: cache_misses: 26549598
bash-1009 [005] ..... 566.864059: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
bash-1009 [005] ..... 566.864060: cache_misses: 26558778
The option will cause the perf event to be triggered after every event.
If cpu_cycles is also enabled:
# echo 1 > options/event_cpu_cycles
# cat trace
[..]
bash-1009 [006] ..... 683.223244: sys_write -> 0x2
bash-1009 [006] ..... 683.223245: cpu_cycles: 273245 cache_misses: 40481492
bash-1009 [006] ..... 683.223262: sys_dup2(oldfd: 0xa, newfd: 1)
bash-1009 [006] ..... 683.223263: cpu_cycles: 286640 cache_misses: 40483017
bash-1009 [006] ..... 683.223278: sys_dup2 -> 0x1
bash-1009 [006] ..... 683.223279: cpu_cycles: 301412 cache_misses: 40486560
bash-1009 [006] ..... 683.223309: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
bash-1009 [006] ..... 683.223310: cpu_cycles: 335188 cache_misses: 40495672
bash-1009 [006] ..... 683.223317: sys_fcntl -> 0x1
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace.c | 113 +++++++++++++++++++++-
kernel/trace/trace.h | 28 ++++++
kernel/trace/trace_entries.h | 13 +++
kernel/trace/trace_event_perf.c | 162 ++++++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 70 ++++++++++++++
5 files changed, 385 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 59cd4ed8af6d..64d966a3ec8b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1110,7 +1110,6 @@ void tracing_on(void)
}
EXPORT_SYMBOL_GPL(tracing_on);
-
static __always_inline void
__buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_event *event)
{
@@ -2915,6 +2914,103 @@ void trace_event_buffer_commit(struct trace_event_buffer *fbuffer)
}
EXPORT_SYMBOL_GPL(trace_event_buffer_commit);
+#ifdef CONFIG_PERF_EVENTS
+static inline void record_perf_event(struct trace_array *tr,
+ struct trace_buffer *buffer,
+ unsigned int trace_ctx)
+{
+ struct ring_buffer_event *event;
+ struct perf_event_entry *entry;
+ int entries = READ_ONCE(tr->perf_events);
+ struct trace_array_cpu *data;
+ u64 *value;
+ int size;
+ int cpu;
+
+ if (!entries)
+ return;
+
+ guard(preempt_notrace)();
+ cpu = smp_processor_id();
+
+ /* Prevent this from recursing */
+ data = per_cpu_ptr(tr->array_buffer.data, cpu);
+ if (unlikely(!data) || local_read(&data->disabled))
+ return;
+
+ if (local_inc_return(&data->disabled) != 1)
+ goto out;
+
+ size = struct_size(entry, values, entries);
+ event = trace_buffer_lock_reserve(buffer, TRACE_PERF_EVENT, size,
+ trace_ctx);
+ if (!event)
+ goto out;
+ entry = ring_buffer_event_data(event);
+ value = entry->values;
+
+ if (tr->trace_flags & TRACE_ITER(PERF_CYCLES)) {
+ *value++ = TRACE_PERF_VALUE(PERF_TRACE_CYCLES);
+ entries--;
+ }
+
+ if (entries && tr->trace_flags & TRACE_ITER(PERF_CACHE)) {
+ *value++ = TRACE_PERF_VALUE(PERF_TRACE_CACHE);
+ entries--;
+ }
+
+ /* If something changed, zero the rest */
+ if (unlikely(entries))
+ memset(value, 0, sizeof(u64) * entries);
+
+ trace_buffer_unlock_commit_nostack(buffer, event);
+ out:
+ local_dec(&data->disabled);
+}
+
+static int handle_perf_event(struct trace_array *tr, u64 mask, int enabled)
+{
+ int ret = 0;
+ int type;
+
+ switch (mask) {
+
+ case TRACE_ITER(PERF_CYCLES):
+ type = PERF_TRACE_CYCLES;
+ break;
+ case TRACE_ITER(PERF_CACHE):
+ type = PERF_TRACE_CACHE;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (enabled)
+ ret = trace_perf_event_enable(type);
+ else
+ trace_perf_event_disable(type);
+
+ if (ret < 0)
+ return ret;
+
+ if (enabled)
+ tr->perf_events++;
+ else
+ tr->perf_events--;
+
+ if (WARN_ON_ONCE(tr->perf_events < 0))
+ tr->perf_events = 0;
+
+ return 0;
+}
+#else
+static inline void record_perf_event(struct trace_array *tr,
+ struct trace_buffer *buffer,
+ unsigned int trace_ctx)
+{
+}
+#endif
+
/*
* Skip 3:
*
@@ -2932,6 +3028,8 @@ void trace_buffer_unlock_commit_regs(struct trace_array *tr,
{
__buffer_unlock_commit(buffer, event);
+ record_perf_event(tr, buffer, trace_ctx);
+
/*
* If regs is not set, then skip the necessary functions.
* Note, we can still get here via blktrace, wakeup tracer
@@ -5287,7 +5385,20 @@ int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled)
update_marker_trace(tr, enabled);
/* update_marker_trace updates the tr->trace_flags */
return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+ case TRACE_ITER(PERF_CACHE):
+ case TRACE_ITER(PERF_CYCLES):
+ {
+ int ret = 0;
+
+ ret = handle_perf_event(tr, mask, enabled);
+ if (ret < 0)
+ return ret;
+ break;
}
+#endif
+ } /* switch (mask) */
if (enabled)
tr->trace_flags |= mask;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 58be6d741d72..094a156b0c70 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -56,6 +56,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_PERF_EVENT,
__TRACE_LAST_TYPE,
};
@@ -363,6 +364,8 @@ struct trace_array {
int buffer_disabled;
+ int perf_events;
+
struct trace_pid_list __rcu *filtered_pids;
struct trace_pid_list __rcu *filtered_no_pids;
/*
@@ -537,6 +540,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \
IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\
IF_ASSIGN(var, ent, struct timerlat_entry, TRACE_TIMERLAT);\
+ IF_ASSIGN(var, ent, struct perf_event_entry, TRACE_PERF_EVENT); \
IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
TRACE_MMIO_RW); \
@@ -1382,6 +1386,29 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
# define TRACE_ITER_PROF_TEXT_OFFSET_BIT -1
#endif
+#ifdef CONFIG_PERF_EVENTS
+#define PERF_MAKE_VALUE(type, val) (((type) << 56) | ((val) & ~(0xffULL << 56)))
+/* Not required, but keep consistent with include/uapi/linux/perf_event.h */
+#define PERF_TRACE_CYCLES 0ULL
+#define PERF_TRACE_CACHE 5ULL
+#define TRACE_PERF_VALUE(type) \
+ PERF_MAKE_VALUE((type), do_trace_perf_event(type))
+#define PERF_TRACE_VALUE(val) ((val) & ~(0xffULL << 56))
+#define PERF_TRACE_TYPE(val) ((val) >> 56)
+# define PERF_FLAGS \
+ C(PERF_CACHE, "event_cache_misses"), \
+ C(PERF_CYCLES, "event_cpu_cycles"),
+
+u64 do_trace_perf_event(int type);
+int trace_perf_event_enable(int type);
+void trace_perf_event_disable(int type);
+#else
+# define PERF_FLAGS
+static inline u64 do_trace_perf_event(int type) { return 0; }
+static inline int trace_perf_event_enable(int type) { return -ENOTSUPP; }
+static inline void trace_perf_event_disable(int type) { }
+#endif /* CONFIG_PERF_EVENTS */
+
/*
* trace_iterator_flags is an enumeration that defines bit
* positions into trace_flags that controls the output.
@@ -1420,6 +1447,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
+ PERF_FLAGS \
BRANCH_FLAGS \
PROFILER_FLAGS \
FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index de294ae2c5c5..ecda463a9d8e 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -456,3 +456,16 @@ FTRACE_ENTRY(timerlat, timerlat_entry,
__entry->context,
__entry->timer_latency)
);
+
+#ifdef CONFIG_PERF_EVENTS
+FTRACE_ENTRY(perf_event, perf_event_entry,
+
+ TRACE_PERF_EVENT,
+
+ F_STRUCT(
+ __dynamic_array(u64, values )
+ ),
+
+ F_printk("values: %lld\n", __entry->values[0])
+);
+#endif
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a6bb7577e8c5..ff864d300251 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -430,6 +430,168 @@ void perf_trace_buf_update(void *record, u16 type)
}
NOKPROBE_SYMBOL(perf_trace_buf_update);
+static void perf_callback(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ /* nop */
+}
+
+struct trace_perf_event {
+ struct perf_event *event;
+};
+
+static struct trace_perf_event __percpu *perf_cache_events;
+static struct trace_perf_event __percpu *perf_cycles_events;
+static DEFINE_MUTEX(perf_event_mutex);
+static int perf_cache_cnt;
+static int perf_cycles_cnt;
+
+static inline int set_perf_type(int type, int *ptype, int *pconfig, int **pcount,
+ struct trace_perf_event __percpu ***pevents)
+{
+ switch (type) {
+ case PERF_TRACE_CYCLES:
+ if (ptype)
+ *ptype = PERF_TYPE_HARDWARE;
+ if (pconfig)
+ *pconfig = PERF_COUNT_HW_CPU_CYCLES;
+ *pcount = &perf_cycles_cnt;
+ *pevents = &perf_cycles_events;
+ return 0;
+
+ case PERF_TRACE_CACHE:
+ if (ptype)
+ *ptype = PERF_TYPE_HW_CACHE;
+ if (pconfig)
+ *pconfig = PERF_COUNT_HW_CACHE_MISSES;
+ *pcount = &perf_cache_cnt;
+ *pevents = &perf_cache_events;
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+u64 do_trace_perf_event(int type)
+{
+ struct trace_perf_event __percpu **pevents;
+ struct trace_perf_event __percpu *events;
+ struct perf_event *e;
+ int *count;
+ int cpu;
+
+ if (set_perf_type(type, NULL, NULL, &count, &pevents) < 0)
+ return 0;
+
+ if (!*count)
+ return 0;
+
+ guard(preempt)();
+
+ events = READ_ONCE(*pevents);
+ if (!events)
+ return 0;
+
+ cpu = smp_processor_id();
+
+ e = per_cpu_ptr(events, cpu)->event;
+ if (!e)
+ return 0;
+
+ e->pmu->read(e);
+ return local64_read(&e->count);
+}
+
+static void __free_trace_perf_events(struct trace_perf_event __percpu *events)
+{
+ struct perf_event *e;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ e = per_cpu_ptr(events, cpu)->event;
+ per_cpu_ptr(events, cpu)->event = NULL;
+ perf_event_release_kernel(e);
+ }
+}
+
+int trace_perf_event_enable(int type)
+{
+ struct perf_event_attr __free(kfree) *attr = NULL;
+ struct trace_perf_event __percpu **pevents;
+ struct trace_perf_event __percpu *events;
+ struct perf_event *e;
+ int *count;
+ int config;
+ int cpu;
+
+ if (set_perf_type(type, &config, &type, &count, &pevents) < 0)
+ return -EINVAL;
+
+ guard(mutex)(&perf_event_mutex);
+
+ if (*count) {
+ (*count)++;
+ return 0;
+ }
+
+ attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+ if (!attr)
+ return -ENOMEM;
+
+ events = alloc_percpu(struct trace_perf_event);
+ if (!events)
+ return -ENOMEM;
+
+ attr->type = type;
+ attr->config = config;
+ attr->size = sizeof(struct perf_event_attr);
+ attr->pinned = 1;
+
+ /* initialize in case of failure */
+ for_each_possible_cpu(cpu) {
+ per_cpu_ptr(events, cpu)->event = NULL;
+ }
+
+ for_each_online_cpu(cpu) {
+ e = perf_event_create_kernel_counter(attr, cpu, NULL,
+ perf_callback, NULL);
+ if (IS_ERR_OR_NULL(e)) {
+ __free_trace_perf_events(events);
+ return PTR_ERR(e);;
+ }
+ per_cpu_ptr(events, cpu)->event = e;
+ }
+
+ WRITE_ONCE(*pevents, events);
+ (*count)++;
+
+ return 0;
+}
+
+void trace_perf_event_disable(int type)
+{
+ struct trace_perf_event __percpu **pevents;
+ struct trace_perf_event __percpu *events;
+ int *count;
+
+ if (set_perf_type(type, NULL, NULL, &count, &pevents) < 0)
+ return;
+
+ guard(mutex)(&perf_event_mutex);
+
+ if (WARN_ON_ONCE(!*count))
+ return;
+
+ if (--(*count))
+ return;
+
+ events = READ_ONCE(*pevents);
+ WRITE_ONCE(*pevents, NULL);
+
+ __free_trace_perf_events(events);
+}
+
#ifdef CONFIG_FUNCTION_TRACER
static void
perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index ebbab3e9622b..a0f21cec9eed 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1661,6 +1661,75 @@ static struct trace_event trace_timerlat_event = {
.funcs = &trace_timerlat_funcs,
};
+/* TRACE_PERF_EVENT */
+
+static enum print_line_t
+trace_perf_event_print(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct trace_entry *entry = iter->ent;
+ struct trace_seq *s = &iter->seq;
+ struct perf_event_entry *field;
+ u64 value;
+ u64 *val;
+ u64 *end;
+
+ end = (u64 *)((long)iter->ent + iter->ent_size);
+
+ trace_assign_type(field, entry);
+
+ for (val = field->values; val < end; val++) {
+ if (val != field->values)
+ trace_seq_putc(s, ' ');
+ value = PERF_TRACE_VALUE(*val);
+ switch (PERF_TRACE_TYPE(*val)) {
+ case PERF_TRACE_CYCLES:
+ trace_seq_printf(s, "cpu_cycles: %lld", value);
+ break;
+ case PERF_TRACE_CACHE:
+ trace_seq_printf(s, "cache_misses: %lld", value);
+ break;
+ default:
+ trace_seq_printf(s, "unkown(%d): %lld",
+ (int)PERF_TRACE_TYPE(*val), value);
+ }
+ }
+ trace_seq_putc(s, '\n');
+ return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_perf_event_raw(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct perf_event_entry *field;
+ struct trace_seq *s = &iter->seq;
+ u64 *val;
+ u64 *end;
+
+ end = (u64 *)((long)iter->ent + iter->ent_size);
+
+ trace_assign_type(field, iter->ent);
+
+ for (val = field->values; val < end; val++) {
+ if (val != field->values)
+ trace_seq_putc(s, ' ');
+ trace_seq_printf(s, "%lld\n", *val);
+ }
+ trace_seq_putc(s, '\n');
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_perf_event_funcs = {
+ .trace = trace_perf_event_print,
+ .raw = trace_perf_event_raw,
+};
+
+static struct trace_event trace_perf_event_event = {
+ .type = TRACE_PERF_EVENT,
+ .funcs = &trace_perf_event_funcs,
+};
+
/* TRACE_BPUTS */
static enum print_line_t
trace_bputs_print(struct trace_iterator *iter, int flags,
@@ -1878,6 +1947,7 @@ static struct trace_event *events[] __initdata = {
&trace_timerlat_event,
&trace_raw_data_event,
&trace_func_repeats_event,
+ &trace_perf_event_event,
NULL
};
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [POC][RFC][PATCH 2/3] ftrace: Add perf counters to function tracing
2025-11-18 0:29 [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 1/3] tracing: Add perf events Steven Rostedt
@ 2025-11-18 0:29 ` Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 3/3] fgraph: Add perf counters to function graph tracer Steven Rostedt
` (2 subsequent siblings)
4 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 0:29 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Thomas Gleixner, Ian Rogers, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Add option to trigger perf events to function tracing.
Two new options are added: func-cpu-cycles and func-cache-misses
# cd /sys/kernel/tracing
# echo 1 > options/func-cache-misses
# echo function > current_tracer
# cat trace
[..]
sshd-session-1014 [005] ..... 327.836708: __x64_sys_read <-do_syscall_64
sshd-session-1014 [005] ..... 327.836708: cache_misses: 741719054
sshd-session-1014 [005] ..... 327.836712: ksys_read <-do_syscall_64
sshd-session-1014 [005] ..... 327.836713: cache_misses: 741720271
sshd-session-1014 [005] ..... 327.836716: fdget_pos <-ksys_read
sshd-session-1014 [005] ..... 327.836717: cache_misses: 741721483
sshd-session-1014 [005] ..... 327.836720: vfs_read <-ksys_read
sshd-session-1014 [005] ..... 327.836721: cache_misses: 741722726
sshd-session-1014 [005] ..... 327.836724: rw_verify_area <-vfs_read
sshd-session-1014 [005] ..... 327.836725: cache_misses: 741723940
sshd-session-1014 [005] ..... 327.836728: security_file_permission <-rw_verify_area
sshd-session-1014 [005] ..... 327.836729: cache_misses: 741725151
The option will cause the perf event to be triggered after every function
called.
If cpu_cycles is also enabled:
# echo 1 > options/func-cpu-cycles
# cat trace
[..]
sshd-session-1014 [005] b..1. 536.844538: preempt_count_sub <-_raw_spin_unlock
sshd-session-1014 [005] b..1. 536.844539: cpu_cycles: 1919425978 cache_misses: 3431216952
sshd-session-1014 [005] b.... 536.844545: validate_xmit_skb_list <-sch_direct_xmit
sshd-session-1014 [005] b.... 536.844545: cpu_cycles: 1919429935 cache_misses: 3431218535
sshd-session-1014 [005] b.... 536.844551: validate_xmit_skb.isra.0 <-validate_xmit_skb_list
sshd-session-1014 [005] b.... 536.844552: cpu_cycles: 1919433763 cache_misses: 3431220112
sshd-session-1014 [005] b.... 536.844557: netif_skb_features <-validate_xmit_skb.isra.0
sshd-session-1014 [005] b.... 536.844558: cpu_cycles: 1919437574 cache_misses: 3431221688
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/trace_recursion.h | 5 +-
kernel/trace/trace.c | 58 ++++++++++++---
kernel/trace/trace.h | 6 ++
kernel/trace/trace_functions.c | 124 ++++++++++++++++++++++++++++++--
4 files changed, 178 insertions(+), 15 deletions(-)
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index ae04054a1be3..c42d86d81afa 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -132,9 +132,12 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
* will think a recursion occurred, and the event will be dropped.
* Let a single instance happen via the TRANSITION_BIT to
* not drop those events.
+ *
+ * When ip is zero, the caller is purposely trying causing
+ * recursion. Don't record it.
*/
bit = TRACE_CTX_TRANSITION + start;
- if (val & (1 << bit)) {
+ if ((val & (1 << bit)) && ip) {
do_ftrace_record_recursion(ip, pip);
return -1;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 64d966a3ec8b..42bf1c046de1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2915,21 +2915,18 @@ void trace_event_buffer_commit(struct trace_event_buffer *fbuffer)
EXPORT_SYMBOL_GPL(trace_event_buffer_commit);
#ifdef CONFIG_PERF_EVENTS
-static inline void record_perf_event(struct trace_array *tr,
- struct trace_buffer *buffer,
- unsigned int trace_ctx)
+static inline void trace_perf_event(struct trace_array *tr,
+ struct trace_buffer *buffer,
+ int entries, u64 flags,
+ unsigned int trace_ctx)
{
struct ring_buffer_event *event;
struct perf_event_entry *entry;
- int entries = READ_ONCE(tr->perf_events);
struct trace_array_cpu *data;
u64 *value;
int size;
int cpu;
- if (!entries)
- return;
-
guard(preempt_notrace)();
cpu = smp_processor_id();
@@ -2949,12 +2946,12 @@ static inline void record_perf_event(struct trace_array *tr,
entry = ring_buffer_event_data(event);
value = entry->values;
- if (tr->trace_flags & TRACE_ITER(PERF_CYCLES)) {
+ if (flags & TRACE_ITER(PERF_CYCLES)) {
*value++ = TRACE_PERF_VALUE(PERF_TRACE_CYCLES);
entries--;
}
- if (entries && tr->trace_flags & TRACE_ITER(PERF_CACHE)) {
+ if (entries && flags & TRACE_ITER(PERF_CACHE)) {
*value++ = TRACE_PERF_VALUE(PERF_TRACE_CACHE);
entries--;
}
@@ -2968,6 +2965,49 @@ static inline void record_perf_event(struct trace_array *tr,
local_dec(&data->disabled);
}
+static inline void record_perf_event(struct trace_array *tr,
+ struct trace_buffer *buffer,
+ unsigned int trace_ctx)
+{
+ int entries = READ_ONCE(tr->perf_events);
+
+ if (!entries)
+ return;
+
+ trace_perf_event(tr, buffer, entries, tr->trace_flags, trace_ctx);
+}
+
+#ifdef CONFIG_FUNCTION_TRACER
+void ftrace_perf_events(struct trace_array *tr, int perf_events,
+ u64 perf_mask, unsigned int trace_ctx)
+{
+ struct trace_buffer *buffer;
+ int bit;
+
+ /*
+ * Prevent any ftrace recursion.
+ * The ftrace_test_recursion_trylock() allows one nested loop
+ * to handle the case where an interrupt comes in and traces
+ * before the preempt_count is updated to the new context.
+ * This one instance allows that function to still be traced.
+ *
+ * The trace_perf_cache_misses() will call functions that function
+ * tracing will want to trace. Prevent this one loop from happening
+ * by taking the the lock again. If an interrupt comes in now,
+ * it may still be dropped, but there's really nothing that can
+ * be done about that until all those locations get fixed.
+ */
+ bit = ftrace_test_recursion_trylock(0, 0);
+
+ buffer = tr->array_buffer.buffer;
+ trace_perf_event(tr, buffer, perf_events, perf_mask, trace_ctx);
+
+ /* bit < 0 means the trylock failed and does not need to be unlocked */
+ if (bit >= 0)
+ ftrace_test_recursion_unlock(bit);
+}
+#endif
+
static int handle_perf_event(struct trace_array *tr, u64 mask, int enabled)
{
int ret = 0;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 094a156b0c70..bb764a2255c7 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -365,6 +365,8 @@ struct trace_array {
int buffer_disabled;
int perf_events;
+ int ftrace_perf_events;
+ u64 ftrace_perf_mask;
struct trace_pid_list __rcu *filtered_pids;
struct trace_pid_list __rcu *filtered_no_pids;
@@ -1402,6 +1404,10 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
u64 do_trace_perf_event(int type);
int trace_perf_event_enable(int type);
void trace_perf_event_disable(int type);
+#ifdef CONFIG_FUNCTION_TRACER
+void ftrace_perf_events(struct trace_array *tr, int perf_events,
+ u64 perf_mask, unsigned int trace_ctx);
+#endif
#else
# define PERF_FLAGS
static inline u64 do_trace_perf_event(int type) { return 0; }
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index c12795c2fb39..97f46ac7ef21 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -47,8 +47,12 @@ enum {
TRACE_FUNC_OPT_NO_REPEATS = 0x2,
TRACE_FUNC_OPT_ARGS = 0x4,
- /* Update this to next highest bit. */
- TRACE_FUNC_OPT_HIGHEST_BIT = 0x8
+ /* Update this to next highest function bit. */
+ TRACE_FUNC_OPT_HIGHEST_BIT = 0x8,
+
+ /* These are just other options */
+ TRACE_FUNC_OPT_PERF_CYCLES = 0x10,
+ TRACE_FUNC_OPT_PERF_CACHE = 0x20,
};
#define TRACE_FUNC_OPT_MASK (TRACE_FUNC_OPT_HIGHEST_BIT - 1)
@@ -143,6 +147,105 @@ static bool handle_func_repeats(struct trace_array *tr, u32 flags_val)
return true;
}
+#ifdef CONFIG_PERF_EVENTS
+static inline void
+do_trace_function(struct trace_array *tr, unsigned long ip,
+ unsigned long parent_ip, unsigned int trace_ctx,
+ struct ftrace_regs *fregs)
+{
+ trace_function(tr, ip, parent_ip, trace_ctx, fregs);
+
+ if (likely(!tr->ftrace_perf_events))
+ return;
+
+ ftrace_perf_events(tr, tr->ftrace_perf_events, tr->ftrace_perf_mask, trace_ctx);
+}
+
+static bool handle_perf_event_flag(struct trace_array *tr, int bit, int set, int *err)
+{
+ u64 mask;
+ int type;
+
+ *err = 0;
+
+ switch (bit) {
+ case TRACE_FUNC_OPT_PERF_CYCLES:
+ mask = TRACE_ITER(PERF_CYCLES);
+ type = PERF_TRACE_CYCLES;
+ break;
+
+ case TRACE_FUNC_OPT_PERF_CACHE:
+ mask = TRACE_ITER(PERF_CACHE);
+ type = PERF_TRACE_CACHE;
+ break;
+
+ default:
+ return 0;
+ }
+
+ if (set)
+ *err = trace_perf_event_enable(type);
+ else
+ trace_perf_event_disable(type);
+
+ if (*err < 0)
+ return 1;
+
+ if (set) {
+ tr->ftrace_perf_events++;
+ tr->ftrace_perf_mask |= mask;
+ } else {
+ tr->ftrace_perf_mask &= ~mask;
+ tr->ftrace_perf_events--;
+ }
+ return 1;
+}
+
+static void ftrace_perf_enable(struct trace_array *tr, int bit)
+{
+ int err;
+
+ if (!(tr->current_trace_flags->val & bit))
+ return;
+
+ handle_perf_event_flag(tr, bit, 1, &err);
+ if (err < 0)
+ tr->current_trace_flags->val &= ~bit;
+}
+
+static void ftrace_perf_disable(struct trace_array *tr, int bit)
+{
+ int err;
+
+ /* Only disable if it was enabled */
+ if (!(tr->current_trace_flags->val & bit))
+ return;
+
+ handle_perf_event_flag(tr, bit, 0, &err);
+}
+
+static void ftrace_perf_init(struct trace_array *tr)
+{
+ ftrace_perf_enable(tr, TRACE_FUNC_OPT_PERF_CYCLES);
+ ftrace_perf_enable(tr, TRACE_FUNC_OPT_PERF_CACHE);
+}
+
+static void ftrace_perf_reset(struct trace_array *tr)
+{
+ ftrace_perf_disable(tr, TRACE_FUNC_OPT_PERF_CYCLES);
+ ftrace_perf_disable(tr, TRACE_FUNC_OPT_PERF_CACHE);
+}
+#else
+#define do_trace_function trace_function
+static inline bool handle_perf_event_flag(struct trace_array *tr, int bit,
+ int set, int *err)
+{
+ return 0;
+}
+static inline void ftrace_perf_init(struct trace_array *tr) { }
+static inline void ftrace_perf_reset(struct trace_array *tr) { }
+#endif /* CONFIG_PERF_EVENTS */
+
static int function_trace_init(struct trace_array *tr)
{
ftrace_func_t func;
@@ -165,6 +268,8 @@ static int function_trace_init(struct trace_array *tr)
tr->array_buffer.cpu = raw_smp_processor_id();
+ ftrace_perf_init(tr);
+
tracing_start_cmdline_record();
tracing_start_function_trace(tr);
return 0;
@@ -172,6 +277,7 @@ static int function_trace_init(struct trace_array *tr)
static void function_trace_reset(struct trace_array *tr)
{
+ ftrace_perf_reset(tr);
tracing_stop_function_trace(tr);
tracing_stop_cmdline_record();
ftrace_reset_array_ops(tr);
@@ -223,7 +329,7 @@ function_trace_call(unsigned long ip, unsigned long parent_ip,
trace_ctx = tracing_gen_ctx_dec();
- trace_function(tr, ip, parent_ip, trace_ctx, NULL);
+ do_trace_function(tr, ip, parent_ip, trace_ctx, NULL);
ftrace_test_recursion_unlock(bit);
}
@@ -245,7 +351,7 @@ function_args_trace_call(unsigned long ip, unsigned long parent_ip,
trace_ctx = tracing_gen_ctx();
- trace_function(tr, ip, parent_ip, trace_ctx, fregs);
+ do_trace_function(tr, ip, parent_ip, trace_ctx, fregs);
ftrace_test_recursion_unlock(bit);
}
@@ -372,7 +478,7 @@ function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
trace_ctx = tracing_gen_ctx_dec();
process_repeats(tr, ip, parent_ip, last_info, trace_ctx);
- trace_function(tr, ip, parent_ip, trace_ctx, NULL);
+ do_trace_function(tr, ip, parent_ip, trace_ctx, NULL);
out:
ftrace_test_recursion_unlock(bit);
@@ -428,6 +534,10 @@ static struct tracer_opt func_opts[] = {
{ TRACER_OPT(func-no-repeats, TRACE_FUNC_OPT_NO_REPEATS) },
#ifdef CONFIG_FUNCTION_TRACE_ARGS
{ TRACER_OPT(func-args, TRACE_FUNC_OPT_ARGS) },
+#endif
+#if CONFIG_PERF_EVENTS
+ { TRACER_OPT(func-cpu-cycles, TRACE_FUNC_OPT_PERF_CYCLES) },
+ { TRACER_OPT(func-cache-misses, TRACE_FUNC_OPT_PERF_CACHE) },
#endif
{ } /* Always set a last empty entry */
};
@@ -457,6 +567,7 @@ func_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
{
ftrace_func_t func;
u32 new_flags;
+ int err;
/* Do nothing if already set. */
if (!!set == !!(tr->current_trace_flags->val & bit))
@@ -466,6 +577,9 @@ func_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
if (tr->current_trace != &function_trace)
return 0;
+ if (handle_perf_event_flag(tr, bit, set, &err))
+ return err;
+
new_flags = (tr->current_trace_flags->val & ~bit) | (set ? bit : 0);
func = select_trace_function(new_flags);
if (!func)
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [POC][RFC][PATCH 3/3] fgraph: Add perf counters to function graph tracer
2025-11-18 0:29 [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 1/3] tracing: Add perf events Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 2/3] ftrace: Add perf counters to function tracing Steven Rostedt
@ 2025-11-18 0:29 ` Steven Rostedt
2025-11-18 3:08 ` [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Masami Hiramatsu
2025-11-18 7:25 ` Namhyung Kim
4 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 0:29 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Thomas Gleixner, Ian Rogers, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Add option to trigger perf events to function graph tracing.
Two new options are added: funcgraph-cpu-cycles and funcgraph-cache-misses
This adds the perf event right after the start of a function and again
just before the end of a function.
# cd /sys/kernel/tracing
# echo 1 > options/funcgraph-cache-misses
# echo vfs_read > set_graph_function
# echo function_graph > current_tracer
# cat trace
[..]
5) | vfs_read() {
5) | /* cache_misses: 822565 */
5) | rw_verify_area() {
5) | /* cache_misses: 824003 */
5) | security_file_permission() {
5) | /* cache_misses: 825440 */
5) | apparmor_file_permission() {
5) | /* cache_misses: 826875 */
5) | aa_file_perm() {
5) | /* cache_misses: 828326 */
5) | __rcu_read_lock() {
5) | /* cache_misses: 829766 */
5) | /* cache_misses: 830785 */
5) 5.116 us | }
5) | __rcu_read_unlock() {
5) | /* cache_misses: 832611 */
5) | /* cache_misses: 833632 */
5) 5.223 us | }
5) | /* cache_misses: 835043 */
5) + 25.462 us | }
5) | /* cache_misses: 836454 */
5) + 35.518 us | }
5) | bpf_lsm_file_permission() {
5) | /* cache_misses: 838276 */
5) | /* cache_misses: 839292 */
5) 4.613 us | }
5) | /* cache_misses: 840697 */
5) + 54.684 us | }
5) | /* cache_misses: 842107 */
5) + 64.449 us | }
The option will cause the perf event to be triggered after every function
called.
If cpu_cycles is also enabled:
# echo 1 > options/funcgraph-cpu-cycles
# cat trace
[..]
3) | vfs_read() {
3) | /* cpu_cycles: 2947481793 cache_misses: 2002984031 */
3) | rw_verify_area() {
3) | /* cpu_cycles: 2947488061 cache_misses: 2002985922 */
3) | security_file_permission() {
3) | /* cpu_cycles: 2947492867 cache_misses: 2002987812 */
3) | apparmor_file_permission() {
3) | /* cpu_cycles: 2947497713 cache_misses: 2002989700 */
3) | aa_file_perm() {
3) | /* cpu_cycles: 2947502560 cache_misses: 2002991604 */
3) | __rcu_read_lock() {
3) | /* cpu_cycles: 2947507398 cache_misses: 2002993497 */
3) | /* cpu_cycles: 2947512435 cache_misses: 2002994969 */
3) 7.586 us | }
3) | __rcu_read_unlock() {
3) | /* cpu_cycles: 2947518226 cache_misses: 2002997248 */
3) | /* cpu_cycles: 2947522328 cache_misses: 2002998722 */
3) 7.211 us | }
3) | /* cpu_cycles: 2947527067 cache_misses: 2003000586 */
3) + 37.581 us | }
3) | /* cpu_cycles: 2947531727 cache_misses: 2003002450 */
3) + 52.061 us | }
3) | bpf_lsm_file_permission() {
3) | /* cpu_cycles: 2947537274 cache_misses: 2003004725 */
3) | /* cpu_cycles: 2947541104 cache_misses: 2003006194 */
3) 7.029 us | }
3) | /* cpu_cycles: 2947545762 cache_misses: 2003008052 */
3) + 80.971 us | }
3) | /* cpu_cycles: 2947550459 cache_misses: 2003009915 */
3) + 95.515 us | }
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace.h | 4 +
kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++++--
2 files changed, 116 insertions(+), 5 deletions(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index bb764a2255c7..64cdb6fda3fb 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -366,7 +366,9 @@ struct trace_array {
int perf_events;
int ftrace_perf_events;
+ int fgraph_perf_events;
u64 ftrace_perf_mask;
+ u64 fgraph_perf_mask;
struct trace_pid_list __rcu *filtered_pids;
struct trace_pid_list __rcu *filtered_no_pids;
@@ -946,6 +948,8 @@ static __always_inline bool ftrace_hash_empty(struct ftrace_hash *hash)
#define TRACE_GRAPH_PRINT_RETVAL_HEX 0x1000
#define TRACE_GRAPH_PRINT_RETADDR 0x2000
#define TRACE_GRAPH_ARGS 0x4000
+#define TRACE_GRAPH_PERF_CACHE 0x8000
+#define TRACE_GRAPH_PERF_CYCLES 0x10000
#define TRACE_GRAPH_PRINT_FILL_SHIFT 28
#define TRACE_GRAPH_PRINT_FILL_MASK (0x3 << TRACE_GRAPH_PRINT_FILL_SHIFT)
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 44d5dc5031e2..e618dd12ca0c 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -22,6 +22,8 @@ static int ftrace_graph_skip_irqs;
/* Do not record function time when task is sleeping */
unsigned int fgraph_no_sleep_time;
+static struct tracer graph_trace;
+
struct fgraph_cpu_data {
pid_t last_pid;
int depth;
@@ -88,6 +90,11 @@ static struct tracer_opt trace_opts[] = {
/* Include sleep time (scheduled out) between entry and return */
{ TRACER_OPT(sleep-time, TRACE_GRAPH_SLEEP_TIME) },
+#ifdef CONFIG_PERF_EVENTS
+ { TRACER_OPT(funcgraph-cache-misses, TRACE_GRAPH_PERF_CACHE) },
+ { TRACER_OPT(funcgraph-cpu-cycles, TRACE_GRAPH_PERF_CYCLES) },
+#endif
+
{ } /* Empty entry */
};
@@ -104,6 +111,97 @@ static bool tracer_flags_is_set(struct trace_array *tr, u32 flags)
return (tr->current_trace_flags->val & flags) == flags;
}
+#ifdef CONFIG_PERF_EVENTS
+static inline void handle_perf_event(struct trace_array *tr, unsigned int trace_ctx)
+{
+ if (!tr->fgraph_perf_events)
+ return;
+ ftrace_perf_events(tr, tr->fgraph_perf_events, tr->fgraph_perf_mask, trace_ctx);
+}
+
+static int ftrace_graph_perf_event(struct trace_array *tr, int set, int bit)
+{
+ u64 mask;
+ int type;
+ int ret = 0;
+
+ /* Do nothing if the current tracer is not this tracer */
+ if (tr->current_trace != &graph_trace)
+ return 0;
+
+ switch (bit) {
+ case TRACE_GRAPH_PERF_CACHE:
+ mask = TRACE_ITER(PERF_CACHE);
+ type = PERF_TRACE_CACHE;
+ break;
+ case TRACE_GRAPH_PERF_CYCLES:
+ mask = TRACE_ITER(PERF_CYCLES);
+ type = PERF_TRACE_CYCLES;
+ break;
+ }
+
+ if (set)
+ ret = trace_perf_event_enable(type);
+ else
+ trace_perf_event_disable(type);
+
+ if (ret < 0)
+ return ret;
+
+ if (set) {
+ tr->fgraph_perf_events++;
+ tr->fgraph_perf_mask |= mask;
+ } else {
+ tr->fgraph_perf_mask &= ~mask;
+ tr->fgraph_perf_events--;
+ }
+ return 0;
+}
+
+static void ftrace_graph_perf_enable(struct trace_array *tr, int bit)
+{
+ int err;
+
+ if (!(tr->current_trace_flags->val & bit))
+ return;
+
+ err = ftrace_graph_perf_event(tr, 1, bit);
+ if (err < 0)
+ tr->current_trace_flags->val &= ~bit;
+}
+
+static void ftrace_graph_perf_disable(struct trace_array *tr, int bit)
+{
+ /* Only disable if it was enabled */
+ if (!(tr->current_trace_flags->val & bit))
+ return;
+
+ ftrace_graph_perf_event(tr, 0, bit);
+}
+
+static void fgraph_perf_init(struct trace_array *tr)
+{
+ ftrace_graph_perf_enable(tr, TRACE_GRAPH_PERF_CYCLES);
+ ftrace_graph_perf_enable(tr, TRACE_GRAPH_PERF_CACHE);
+}
+
+static void fgraph_perf_reset(struct trace_array *tr)
+{
+ ftrace_graph_perf_disable(tr, TRACE_GRAPH_PERF_CYCLES);
+ ftrace_graph_perf_disable(tr, TRACE_GRAPH_PERF_CACHE);
+}
+#else
+static inline void handle_perf_event(struct trace_array *tr, unsigned int trace_ctx)
+{
+}
+static inline void fgraph_perf_init(struct trace_array *tr)
+{
+}
+static inline void fgraph_perf_reset(struct trace_array *tr)
+{
+}
+#endif
+
/*
* DURATION column is being also used to display IRQ signs,
* following values are used by print_graph_irq and others
@@ -272,6 +370,9 @@ static int graph_entry(struct ftrace_graph_ent *trace,
ret = __graph_entry(tr, trace, trace_ctx, fregs);
}
+ if (ret)
+ handle_perf_event(tr, trace_ctx);
+
return ret;
}
@@ -324,6 +425,8 @@ void __trace_graph_return(struct trace_array *tr,
struct trace_buffer *buffer = tr->array_buffer.buffer;
struct ftrace_graph_ret_entry *entry;
+ handle_perf_event(tr, trace_ctx);
+
event = trace_buffer_lock_reserve(buffer, TRACE_GRAPH_RET,
sizeof(*entry), trace_ctx);
if (!event)
@@ -465,6 +568,8 @@ static int graph_trace_init(struct trace_array *tr)
if (!tracer_flags_is_set(tr, TRACE_GRAPH_SLEEP_TIME))
fgraph_no_sleep_time++;
+ fgraph_perf_init(tr);
+
/* Make gops functions visible before we start tracing */
smp_mb();
@@ -476,8 +581,6 @@ static int graph_trace_init(struct trace_array *tr)
return 0;
}
-static struct tracer graph_trace;
-
static int ftrace_graph_trace_args(struct trace_array *tr, int set)
{
trace_func_graph_ent_t entry;
@@ -512,6 +615,7 @@ static void graph_trace_reset(struct trace_array *tr)
if (WARN_ON_ONCE(fgraph_no_sleep_time < 0))
fgraph_no_sleep_time = 0;
+ fgraph_perf_reset(tr);
tracing_stop_cmdline_record();
unregister_ftrace_graph(tr->gops);
}
@@ -1684,9 +1788,12 @@ func_graph_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
ftrace_graph_skip_irqs = 0;
break;
- case TRACE_GRAPH_ARGS:
- return ftrace_graph_trace_args(tr, set);
- }
+#ifdef CONFIG_PERF_EVENTS
+ case TRACE_GRAPH_PERF_CACHE:
+ case TRACE_GRAPH_PERF_CYCLES:
+ return ftrace_graph_perf_event(tr, set, bit);
+#endif
+ };
return 0;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 0:29 [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
` (2 preceding siblings ...)
2025-11-18 0:29 ` [POC][RFC][PATCH 3/3] fgraph: Add perf counters to function graph tracer Steven Rostedt
@ 2025-11-18 3:08 ` Masami Hiramatsu
2025-11-18 3:42 ` Steven Rostedt
2025-11-18 7:25 ` Namhyung Kim
4 siblings, 1 reply; 15+ messages in thread
From: Masami Hiramatsu @ 2025-11-18 3:08 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Ian Rogers, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Douglas Raillard
Hi Steve,
Thanks for the great idea!
On Mon, 17 Nov 2025 19:29:50 -0500
Steven Rostedt <rostedt@kernel.org> wrote:
>
> This series adds a perf event to the ftrace ring buffer.
> It is currently a proof of concept as I'm not happy with the interface
> and I also think the recorded perf event format may be changed too.
>
> This proof-of-concept interface (which I have no plans on using), currently
> just adds 6 new trace options.
>
> event_cache_misses
> event_cpu_cycles
> func-cache-misses
> func-cpu-cycles
> funcgraph-cache-misses
> funcgraph-cpu-cycles
>
> The first two trigger a perf event after every event, the second two trigger
> a perf event after every function and the last two trigger a perf event
> right after the start of a function and again at the end of the function.
>
> As this will eventual work with many more perf events than just cache-misses
> and cpu-cycles , using options is not appropriate. Especially since the
> options are limited to a 64 bit bitmask, and that can easily go much higher.
> I'm thinking about having a file instead that will act as a way to enable
> perf events for events, function and function graph tracing.
>
> set_event_perf, set_ftrace_perf, set_fgraph_perf
What about adding a global `trigger` action file so that user can
add these "perf" actions to write into it. It is something like
stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
into it too)
For pre-defined/software counters:
# echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger
For some hardware event sources (see /sys/bus/event_source/devices/):
# echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger
echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger
If we need to set those counters for tracers and events separately,
we can add `events/trigger` and `tracer-trigger` files.
echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
To disable counters, we can use '!' as same as event triggers.
echo !perf:cpu_cycles > trigger
To add more than 2 counters, connect it with ':'.
(or, we will allow to append new perf counters)
This allows user to set perf counter options for each events.
Maybe we also should move 'stacktrace'/'userstacktrace' option
flags to it too eventually.
>
> And an available_perf_events that show what can be written into these files,
> (similar to how set_ftrace_filter works). But for now, it was just easier to
> implement them as options.
>
> As for the perf event that is triggered. It currently is a dynamic array of
> 64 bit values. Each value is broken up into 8 bits for what type of perf
> event it is, and 56 bits for the counter. It only writes a per CPU raw
> counter and does not do any math. That would be needed to be done by any
> post processing.
>
> Since the values are for user space to do the subtraction to figure out the
> difference between events, for example, the function_graph tracer may have:
>
> is_vmalloc_addr() {
> /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> }
Just a style question: Would this mean the first line is for function entry
and the second one is function return?
>
> User space would subtract 2869006049 - 2869004572 = 1477
>
> Then 56 bits should be plenty.
>
> 2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
> 416 / 4 = 104
>
> If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
> days. This tooling is not for seeing how many cycles run over 104 days.
> User space tooling would just need to be aware that the vale is 56 bits and
> when calculating the difference between start and end do something like:
>
> if (start > end)
> end |= 1ULL << 56;
>
> delta = end - start;
>
> The next question is how to label the perf events to be in the 8 bit
> portion. It could simply be a value that is registered, and listed in the
> available_perf_events file.
>
> cpu_cycles:1
> cach_misses:2
> [..]
Looks good to me. I think pre-definied events of `perf list`
will be there and have fixed numbers.
Thank you,
>
> And this would need to be recorded by any tooling reading the events
> so that it knows how to map the events with their attached ids.
>
> But again, this is just a proof-of-concept. How this will eventually be
> implemented is yet to be determined.
>
> But to test these patches (which are based on top of my linux-next branch,
> which should now be in linux-next):
>
> # cd /sys/kernel/tracing
> # echo 1 > options/event_cpu_cycles
> # echo 1 > options/event_cache_misses
> # echo 1 > events/syscalls/enable
> # cat trace
> [..]
> bash-995 [007] ..... 98.255252: sys_write -> 0x2
> bash-995 [007] ..... 98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
> bash-995 [007] ..... 98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
> bash-995 [007] ..... 98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
> bash-995 [007] ..... 98.255305: sys_dup2 -> 0x1
> bash-995 [007] ..... 98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
> bash-995 [007] ..... 98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
> bash-995 [007] ..... 98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
> bash-995 [007] ..... 98.255352: sys_fcntl -> 0x1
> bash-995 [007] ..... 98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
> bash-995 [007] ..... 98.255361: sys_close(fd: 0xa)
> bash-995 [007] ..... 98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
> bash-995 [007] ..... 98.255369: sys_close -> 0x0
>
>
>
> Comments welcomed.
>
>
> Steven Rostedt (3):
> tracing: Add perf events
> ftrace: Add perf counters to function tracing
> fgraph: Add perf counters to function graph tracer
>
> ----
> include/linux/trace_recursion.h | 5 +-
> kernel/trace/trace.c | 153 ++++++++++++++++++++++++++++++++-
> kernel/trace/trace.h | 38 ++++++++
> kernel/trace/trace_entries.h | 13 +++
> kernel/trace/trace_event_perf.c | 162 +++++++++++++++++++++++++++++++++++
> kernel/trace/trace_functions.c | 124 +++++++++++++++++++++++++--
> kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
> kernel/trace/trace_output.c | 70 +++++++++++++++
> 8 files changed, 670 insertions(+), 12 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 3:08 ` [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Masami Hiramatsu
@ 2025-11-18 3:42 ` Steven Rostedt
2025-11-18 8:11 ` Masami Hiramatsu
0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 3:42 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Peter Zijlstra, Thomas Gleixner, Ian Rogers,
Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Douglas Raillard
On Tue, 18 Nov 2025 12:08:21 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> Hi Steve,
>
> Thanks for the great idea!
Thanks!
> >
> > As this will eventual work with many more perf events than just cache-misses
> > and cpu-cycles , using options is not appropriate. Especially since the
> > options are limited to a 64 bit bitmask, and that can easily go much higher.
> > I'm thinking about having a file instead that will act as a way to enable
> > perf events for events, function and function graph tracing.
> >
> > set_event_perf, set_ftrace_perf, set_fgraph_perf
>
> What about adding a global `trigger` action file so that user can
> add these "perf" actions to write into it. It is something like
> stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
> into it too)
>
> For pre-defined/software counters:
> # echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger
For events, it would make more sense to put it into the events directory:
# echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
As there is already a events/enable
Heck we could even add it per system:
# echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/syscalls/trigger
>
> For some hardware event sources (see /sys/bus/event_source/devices/):
> # echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger
>
> echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger
Still need a way to add an identifier list. Currently, if the size of
the type identifier is one byte, then it can only support up to 256 events.
Do we need every event for this? Or just have a subset of events that
would be supported?
>
> If we need to set those counters for tracers and events separately,
> we can add `events/trigger` and `tracer-trigger` files.
As I mentioned, the trigger for events should be in the events directory.
We could add a ftrace_trigger that can affect both function and
function graph tracer.
>
> echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
>
> To disable counters, we can use '!' as same as event triggers.
>
> echo !perf:cpu_cycles > trigger
Yes, it would follow the current way to disable a trigger.
>
> To add more than 2 counters, connect it with ':'.
> (or, we will allow to append new perf counters)
> This allows user to set perf counter options for each events.
>
> Maybe we also should move 'stacktrace'/'userstacktrace' option
> flags to it too eventually.
We can add them, but may never be able to remove them due to backward
compatibility.
> >
> > And an available_perf_events that show what can be written into these files,
> > (similar to how set_ftrace_filter works). But for now, it was just easier to
> > implement them as options.
> >
> > As for the perf event that is triggered. It currently is a dynamic array of
> > 64 bit values. Each value is broken up into 8 bits for what type of perf
> > event it is, and 56 bits for the counter. It only writes a per CPU raw
> > counter and does not do any math. That would be needed to be done by any
> > post processing.
> >
> > Since the values are for user space to do the subtraction to figure out the
> > difference between events, for example, the function_graph tracer may have:
> >
> > is_vmalloc_addr() {
> > /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> > /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> > }
>
> Just a style question: Would this mean the first line is for function entry
> and the second one is function return?
Yes.
Perhaps we could add field to the perf event to allow for annotation,
so the above could look like:
is_vmalloc_addr() {
/* --> cpu_cycles: 5582263593 cache_misses: 2869004572 */
/* <-- cpu_cycles: 5582267527 cache_misses: 2869006049 */
}
Or something similar?
> > The next question is how to label the perf events to be in the 8 bit
> > portion. It could simply be a value that is registered, and listed in the
> > available_perf_events file.
> >
> > cpu_cycles:1
> > cach_misses:2
> > [..]
>
> Looks good to me. I think pre-definied events of `perf list`
> will be there and have fixed numbers.
Thanks for looking at this,
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 0:29 [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
` (3 preceding siblings ...)
2025-11-18 3:08 ` [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Masami Hiramatsu
@ 2025-11-18 7:25 ` Namhyung Kim
2025-11-18 16:24 ` Steven Rostedt
4 siblings, 1 reply; 15+ messages in thread
From: Namhyung Kim @ 2025-11-18 7:25 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Ian Rogers, Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
Hi Steve,
On Mon, Nov 17, 2025 at 07:29:50PM -0500, Steven Rostedt wrote:
>
> This series adds a perf event to the ftrace ring buffer.
> It is currently a proof of concept as I'm not happy with the interface
> and I also think the recorded perf event format may be changed too.
>
> This proof-of-concept interface (which I have no plans on using), currently
> just adds 6 new trace options.
>
> event_cache_misses
> event_cpu_cycles
> func-cache-misses
> func-cpu-cycles
> funcgraph-cache-misses
> funcgraph-cpu-cycles
Unfortunately the hardware cache event is ambiguous on which level it
refers to and architectures define it differently. There are
encodings to clearly define the cache levels and accesses but the
support depends on the hardware capabilities.
>
> The first two trigger a perf event after every event, the second two trigger
> a perf event after every function and the last two trigger a perf event
> right after the start of a function and again at the end of the function.
>
> As this will eventual work with many more perf events than just cache-misses
> and cpu-cycles , using options is not appropriate. Especially since the
> options are limited to a 64 bit bitmask, and that can easily go much higher.
> I'm thinking about having a file instead that will act as a way to enable
> perf events for events, function and function graph tracing.
>
> set_event_perf, set_ftrace_perf, set_fgraph_perf
>
> And an available_perf_events that show what can be written into these files,
> (similar to how set_ftrace_filter works). But for now, it was just easier to
> implement them as options.
>
> As for the perf event that is triggered. It currently is a dynamic array of
> 64 bit values. Each value is broken up into 8 bits for what type of perf
> event it is, and 56 bits for the counter. It only writes a per CPU raw
> counter and does not do any math. That would be needed to be done by any
> post processing.
If you want to keep the perf events per CPU, you may consider CPU
migrations for the func-graph case. Otherwise userspace may not
calculate the diff from the begining correctly.
>
> Since the values are for user space to do the subtraction to figure out the
> difference between events, for example, the function_graph tracer may have:
>
> is_vmalloc_addr() {
> /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> }
>
> User space would subtract 2869006049 - 2869004572 = 1477
>
> Then 56 bits should be plenty.
>
> 2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
> 416 / 4 = 104
>
> If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
> days. This tooling is not for seeing how many cycles run over 104 days.
> User space tooling would just need to be aware that the vale is 56 bits and
> when calculating the difference between start and end do something like:
>
> if (start > end)
> end |= 1ULL << 56;
>
> delta = end - start;
>
> The next question is how to label the perf events to be in the 8 bit
> portion. It could simply be a value that is registered, and listed in the
> available_perf_events file.
>
> cpu_cycles:1
> cach_misses:2
> [..]
>
> And this would need to be recorded by any tooling reading the events
> so that it knows how to map the events with their attached ids.
>
> But again, this is just a proof-of-concept. How this will eventually be
> implemented is yet to be determined.
>
> But to test these patches (which are based on top of my linux-next branch,
> which should now be in linux-next):
>
> # cd /sys/kernel/tracing
> # echo 1 > options/event_cpu_cycles
> # echo 1 > options/event_cache_misses
> # echo 1 > events/syscalls/enable
> # cat trace
> [..]
> bash-995 [007] ..... 98.255252: sys_write -> 0x2
> bash-995 [007] ..... 98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
> bash-995 [007] ..... 98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
> bash-995 [007] ..... 98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
> bash-995 [007] ..... 98.255305: sys_dup2 -> 0x1
> bash-995 [007] ..... 98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
> bash-995 [007] ..... 98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
> bash-995 [007] ..... 98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
> bash-995 [007] ..... 98.255352: sys_fcntl -> 0x1
> bash-995 [007] ..... 98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
> bash-995 [007] ..... 98.255361: sys_close(fd: 0xa)
> bash-995 [007] ..... 98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
> bash-995 [007] ..... 98.255369: sys_close -> 0x0
>
>
>
> Comments welcomed.
Just FYI, I did the similar thing (like fgraph case) in uftrace and I
grouped two related events to produce a metric.
$ uftrace -T a@read=pmu-cycle ~/tmp/abc
# DURATION TID FUNCTION
[ 521741] | main() {
[ 521741] | a() {
[ 521741] | /* read:pmu-cycle (cycles=482 instructions=38) */
[ 521741] | b() {
[ 521741] | c() {
0.659 us [ 521741] | getpid();
1.600 us [ 521741] | } /* c */
1.780 us [ 521741] | } /* b */
[ 521741] | /* diff:pmu-cycle (cycles=+7361 instructions=+3955 IPC=0.54) */
24.485 us [ 521741] | } /* a */
34.797 us [ 521741] | } /* main */
It reads cycles and instructions events (specified by 'pmu-cycle') at
entry and exit of the given function ('a') and shows the diff with the
metric IPC.
Thanks,
Namhyung
>
>
> Steven Rostedt (3):
> tracing: Add perf events
> ftrace: Add perf counters to function tracing
> fgraph: Add perf counters to function graph tracer
>
> ----
> include/linux/trace_recursion.h | 5 +-
> kernel/trace/trace.c | 153 ++++++++++++++++++++++++++++++++-
> kernel/trace/trace.h | 38 ++++++++
> kernel/trace/trace_entries.h | 13 +++
> kernel/trace/trace_event_perf.c | 162 +++++++++++++++++++++++++++++++++++
> kernel/trace/trace_functions.c | 124 +++++++++++++++++++++++++--
> kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
> kernel/trace/trace_output.c | 70 +++++++++++++++
> 8 files changed, 670 insertions(+), 12 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 3:42 ` Steven Rostedt
@ 2025-11-18 8:11 ` Masami Hiramatsu
2025-11-18 13:53 ` Steven Rostedt
2025-11-18 16:31 ` Steven Rostedt
0 siblings, 2 replies; 15+ messages in thread
From: Masami Hiramatsu @ 2025-11-18 8:11 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Peter Zijlstra, Thomas Gleixner, Ian Rogers,
Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Douglas Raillard
On Mon, 17 Nov 2025 22:42:27 -0500
Steven Rostedt <rostedt@kernel.org> wrote:
> > > As this will eventual work with many more perf events than just cache-misses
> > > and cpu-cycles , using options is not appropriate. Especially since the
> > > options are limited to a 64 bit bitmask, and that can easily go much higher.
> > > I'm thinking about having a file instead that will act as a way to enable
> > > perf events for events, function and function graph tracing.
> > >
> > > set_event_perf, set_ftrace_perf, set_fgraph_perf
> >
> > What about adding a global `trigger` action file so that user can
> > add these "perf" actions to write into it. It is something like
> > stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
> > into it too)
> >
> > For pre-defined/software counters:
> > # echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger
>
> For events, it would make more sense to put it into the events directory:
>
> # echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
>
> As there is already a events/enable
>
> Heck we could even add it per system:
>
> # echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/syscalls/trigger
Yes, this will be very useful!
>
> >
> > For some hardware event sources (see /sys/bus/event_source/devices/):
> > # echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger
> >
> > echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger
>
> Still need a way to add an identifier list. Currently, if the size of
> the type identifier is one byte, then it can only support up to 256 events.
Yes, so if user adds more than that, it will return -ENOSPC.
>
> Do we need every event for this? Or just have a subset of events that
> would be supported?
For the event tracing, maybe those are used as measuring delta between
paired events. For such use case, user may want to set it only on those
events.
>
>
> >
> > If we need to set those counters for tracers and events separately,
> > we can add `events/trigger` and `tracer-trigger` files.
>
> As I mentioned, the trigger for events should be in the events directory.
Agreed.
>
> We could add a ftrace_trigger that can affect both function and
> function graph tracer.
Got it.
>
> >
> > echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
> >
> > To disable counters, we can use '!' as same as event triggers.
> >
> > echo !perf:cpu_cycles > trigger
>
> Yes, it would follow the current way to disable a trigger.
>
> >
> > To add more than 2 counters, connect it with ':'.
> > (or, we will allow to append new perf counters)
> > This allows user to set perf counter options for each events.
> >
> > Maybe we also should move 'stacktrace'/'userstacktrace' option
> > flags to it too eventually.
>
> We can add them, but may never be able to remove them due to backward
> compatibility.
Ah, indeed.
>
> > >
> > > And an available_perf_events that show what can be written into these files,
> > > (similar to how set_ftrace_filter works). But for now, it was just easier to
> > > implement them as options.
> > >
> > > As for the perf event that is triggered. It currently is a dynamic array of
> > > 64 bit values. Each value is broken up into 8 bits for what type of perf
> > > event it is, and 56 bits for the counter. It only writes a per CPU raw
> > > counter and does not do any math. That would be needed to be done by any
> > > post processing.
> > >
> > > Since the values are for user space to do the subtraction to figure out the
> > > difference between events, for example, the function_graph tracer may have:
> > >
> > > is_vmalloc_addr() {
> > > /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> > > /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> > > }
> >
> > Just a style question: Would this mean the first line is for function entry
> > and the second one is function return?
>
> Yes.
>
> Perhaps we could add field to the perf event to allow for annotation,
> so the above could look like:
>
> is_vmalloc_addr() {
> /* --> cpu_cycles: 5582263593 cache_misses: 2869004572 */
> /* <-- cpu_cycles: 5582267527 cache_misses: 2869006049 */
> }
>
> Or something similar?
Yeah, it looks more readable.
Thank you!
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 1/3] tracing: Add perf events
2025-11-18 0:29 ` [POC][RFC][PATCH 1/3] tracing: Add perf events Steven Rostedt
@ 2025-11-18 8:35 ` Peter Zijlstra
2025-11-18 13:42 ` Steven Rostedt
0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2025-11-18 8:35 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Thomas Gleixner, Ian Rogers,
Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Douglas Raillard
On Mon, Nov 17, 2025 at 07:29:51PM -0500, Steven Rostedt wrote:
> +u64 do_trace_perf_event(int type)
> +{
> + struct trace_perf_event __percpu **pevents;
> + struct trace_perf_event __percpu *events;
> + struct perf_event *e;
> + int *count;
> + int cpu;
> +
> + if (set_perf_type(type, NULL, NULL, &count, &pevents) < 0)
> + return 0;
> +
> + if (!*count)
> + return 0;
> +
> + guard(preempt)();
> +
> + events = READ_ONCE(*pevents);
> + if (!events)
> + return 0;
> +
> + cpu = smp_processor_id();
> +
> + e = per_cpu_ptr(events, cpu)->event;
> + if (!e)
> + return 0;
> +
> + e->pmu->read(e);
> + return local64_read(&e->count);
> +}
NAK, wtf do you think its okay to use internal stuff like that? And
wrongly while at it.
What you wanted to use was perf_event_read_local().
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 1/3] tracing: Add perf events
2025-11-18 8:35 ` Peter Zijlstra
@ 2025-11-18 13:42 ` Steven Rostedt
2025-11-18 20:24 ` Steven Rostedt
0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 13:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Ian Rogers, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
On Tue, 18 Nov 2025 09:35:10 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> > + cpu = smp_processor_id();
> > +
> > + e = per_cpu_ptr(events, cpu)->event;
> > + if (!e)
> > + return 0;
> > +
> > + e->pmu->read(e);
> > + return local64_read(&e->count);
> > +}
>
> NAK, wtf do you think its okay to use internal stuff like that? And
> wrongly while at it.
Peter, this is a PROOF-OF-CONCEPT. It means I'm showing the concept and not
the implementation. I'm hoping the NAK is on the implementation and not the
concept.
>
> What you wanted to use was perf_event_read_local().
Great! I didn't know about that. Which is why I posted this as a
PROOF-OF-CONCEPT and not even a normal RFC, so that I could learn about the
proper way of doing this.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 8:11 ` Masami Hiramatsu
@ 2025-11-18 13:53 ` Steven Rostedt
2025-11-18 13:57 ` Steven Rostedt
2025-11-18 16:31 ` Steven Rostedt
1 sibling, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 13:53 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Ian Rogers, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Douglas Raillard
On Tue, 18 Nov 2025 17:11:47 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger
> >
> > Still need a way to add an identifier list. Currently, if the size of
> > the type identifier is one byte, then it can only support up to 256 events.
>
> Yes, so if user adds more than that, it will return -ENOSPC.
The issue is that the ids are defined by what is possible, not by what the
user enables.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 13:53 ` Steven Rostedt
@ 2025-11-18 13:57 ` Steven Rostedt
0 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 13:57 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Ian Rogers, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Douglas Raillard
On Tue, 18 Nov 2025 08:53:24 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> > Yes, so if user adds more than that, it will return -ENOSPC.
>
> The issue is that the ids are defined by what is possible, not by what the
> user enables.
Now we could take 4 more bits from the mask and bring the raw value down to
just 52 bits. 2^51 at 4GHz is still 6 days. Which is plenty more than required.
This will make the id 12 bits, or 4096 different defined events.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 7:25 ` Namhyung Kim
@ 2025-11-18 16:24 ` Steven Rostedt
0 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 16:24 UTC (permalink / raw)
To: Namhyung Kim
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Thomas Gleixner, Ian Rogers,
Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
On Mon, 17 Nov 2025 23:25:56 -0800
Namhyung Kim <namhyung@kernel.org> wrote:
> > As for the perf event that is triggered. It currently is a dynamic array of
> > 64 bit values. Each value is broken up into 8 bits for what type of perf
> > event it is, and 56 bits for the counter. It only writes a per CPU raw
> > counter and does not do any math. That would be needed to be done by any
> > post processing.
>
> If you want to keep the perf events per CPU, you may consider CPU
> migrations for the func-graph case. Otherwise userspace may not
> calculate the diff from the begining correctly.
That's easily solved by the user space too adding a sched_switch perf event
trigger. ;-)
>
> Just FYI, I did the similar thing (like fgraph case) in uftrace and I
> grouped two related events to produce a metric.
>
> $ uftrace -T a@read=pmu-cycle ~/tmp/abc
> # DURATION TID FUNCTION
> [ 521741] | main() {
> [ 521741] | a() {
> [ 521741] | /* read:pmu-cycle (cycles=482 instructions=38) */
> [ 521741] | b() {
> [ 521741] | c() {
> 0.659 us [ 521741] | getpid();
> 1.600 us [ 521741] | } /* c */
> 1.780 us [ 521741] | } /* b */
> [ 521741] | /* diff:pmu-cycle (cycles=+7361 instructions=+3955 IPC=0.54) */
> 24.485 us [ 521741] | } /* a */
> 34.797 us [ 521741] | } /* main */
>
> It reads cycles and instructions events (specified by 'pmu-cycle') at
> entry and exit of the given function ('a') and shows the diff with the
> metric IPC.
I originally tried to implement this, but then it became more complex than
I wanted in the kernel. As then I need to add a hook in the sched_switch
and record the perf event counter there, and keep track of it for every
task. That would require memory to be saved somewhere. I started adding it
to the function graph shadow stack and then just decided that it would be
so much easier to let user space figure it out.
By running function graph tracer and showing the start and end counters, as
well as the counters at the sched_switch trace event, user space could do
all the math and accounting, and the code in the kernel can remain simple.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
2025-11-18 8:11 ` Masami Hiramatsu
2025-11-18 13:53 ` Steven Rostedt
@ 2025-11-18 16:31 ` Steven Rostedt
1 sibling, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 16:31 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Ian Rogers, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Douglas Raillard
On Tue, 18 Nov 2025 17:11:47 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > If we need to set those counters for tracers and events separately,
> > > we can add `events/trigger` and `tracer-trigger` files.
> >
> > As I mentioned, the trigger for events should be in the events directory.
>
> Agreed.
>
> >
> > We could add a ftrace_trigger that can affect both function and
> > function graph tracer.
>
Actually, I should add "trigger" files in the ftrace events:
events/ftrace/function/trigger
events/ftrace/funcgraph_entry/tigger
events/ftrace/funcgraph_exit/tigger
Hmm,
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [POC][RFC][PATCH 1/3] tracing: Add perf events
2025-11-18 13:42 ` Steven Rostedt
@ 2025-11-18 20:24 ` Steven Rostedt
0 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2025-11-18 20:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Ian Rogers, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Douglas Raillard
On Tue, 18 Nov 2025 08:42:26 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > What you wanted to use was perf_event_read_local().
>
> Great! I didn't know about that. Which is why I posted this as a
> PROOF-OF-CONCEPT and not even a normal RFC, so that I could learn about the
> proper way of doing this.
I folded in this change:
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index ff864d300251..34962f80dce1 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -478,6 +478,7 @@ u64 do_trace_perf_event(int type)
struct trace_perf_event __percpu **pevents;
struct trace_perf_event __percpu *events;
struct perf_event *e;
+ u64 val;
int *count;
int cpu;
@@ -499,8 +500,10 @@ u64 do_trace_perf_event(int type)
if (!e)
return 0;
- e->pmu->read(e);
- return local64_read(&e->count);
+ if (perf_event_read_local(e, &val, NULL, NULL) < 0)
+ return 0;
+
+ return val;
}
static void __free_trace_perf_events(struct trace_perf_event __percpu *events)
Thanks!
-- Steve
^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-11-18 20:24 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-18 0:29 [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 1/3] tracing: Add perf events Steven Rostedt
2025-11-18 8:35 ` Peter Zijlstra
2025-11-18 13:42 ` Steven Rostedt
2025-11-18 20:24 ` Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 2/3] ftrace: Add perf counters to function tracing Steven Rostedt
2025-11-18 0:29 ` [POC][RFC][PATCH 3/3] fgraph: Add perf counters to function graph tracer Steven Rostedt
2025-11-18 3:08 ` [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Masami Hiramatsu
2025-11-18 3:42 ` Steven Rostedt
2025-11-18 8:11 ` Masami Hiramatsu
2025-11-18 13:53 ` Steven Rostedt
2025-11-18 13:57 ` Steven Rostedt
2025-11-18 16:31 ` Steven Rostedt
2025-11-18 7:25 ` Namhyung Kim
2025-11-18 16:24 ` Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).